python string split by comma that appears only between two specific characters that is ><

Question:

I have a string:

<div class="options mceEditable">The membrane is a dynamic structure, and its constituents are in constant movement.</div>, <div class="options mceEditable">The lipids component of the membrane constitutes a bilayer of hydrophilic ends</div>, <div class="options mceEditable">The lipid content of the membrane is more than that of the protein</div>, <div class="options mceEditable">The proteins may either be carriers or receptors only</div>, <div class="options mceEditable">It is a 3-layered lipid structure</div>

I want to split the above string by comma with the condition that comma should appear between >,< or , <div no where else.

The expected output:

['<div class="options mceEditable">The membrane is a dynamic structure, and its constituents are in constant movement.</div>',
 '<div class="options mceEditable">The lipids component of the membrane constitutes a bilayer of hydrophilic ends</div>',
 '<div class="options mceEditable">The lipid content of the membrane is more than that of the protein</div>',
 '<div class="options mceEditable">The proteins may either be carriers or receptors only</div>',
 '<div class="options mceEditable">It is a 3-layered lipid structure</div>']

What I tried:

options = test3.split(">, <")
options=options.replace("</div'","</div>'")

the above two methods did not yield the result.
Can someone help please?

Asked By: Shiva

||

Answers:

You can use BeautifulSoup:

# pip install bs4
import bs4

soup = bs4.BeautifulSoup(s)
divs = [str(div) for div in soup.find_all('div')]

Output:

>>> divs
['<div class="options mceEditable">The membrane is a dynamic structure, and its constituents are in constant movement.</div>',
 '<div class="options mceEditable">The lipids component of the membrane constitutes a bilayer of hydrophilic ends</div>',
 '<div class="options mceEditable">The lipid content of the membrane is more than that of the protein</div>',
 '<div class="options mceEditable">The proteins may either be carriers or receptors only</div>',
 '<div class="options mceEditable">It is a 3-layered lipid structure</div>']
Answered By: Corralien

Normally I wouldn’t advise regexes on anything related to XML/HTML, but since your input is some processed form of those and no longer valid, I’d say it is acceptable to use regexes in this case, if you can’t fix it at the source of data:

import re

s = '<div class="options mceEditable">The membrane is a dynamic structure, and its constituents are in constant movement.</div>, <div class="options mceEditable">The lipids component of the membrane constitutes a bilayer of hydrophilic ends</div>, <div class="options mceEditable">The lipid content of the membrane is more than that of the protein</div>, <div class="options mceEditable">The proteins may either be carriers or receptors only</div>, <div class="options mceEditable">It is a 3-layered lipid structure</div>'  

pattern = r'<div class="options mceEditable">.*?</div>'

matches = re.findall(pattern, s, re.U)
for m in matches:
    print(m)

Output:

<div class="options mceEditable">The membrane is a dynamic structure, and its constituents are in constant movement.</div>
<div class="options mceEditable">The lipids component of the membrane constitutes a bilayer of hydrophilic ends</div>
<div class="options mceEditable">The lipid content of the membrane is more than that of the protein</div>
<div class="options mceEditable">The proteins may either be carriers or receptors only</div>
<div class="options mceEditable">It is a 3-layered lipid structure</div>
Answered By: matszwecja
count = text.count("</div>")
text = text.split("</div>,")
m = 1

for i in text :
    if m < count : 
        print(i, end= "</div>," + 'n')
        m = m + 1
    else :
        print(i, end = 'n')
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.