python string split by comma that appears only between two specific characters that is ><
Question:
I have a string:
<div class="options mceEditable">The membrane is a dynamic structure, and its constituents are in constant movement.</div>, <div class="options mceEditable">The lipids component of the membrane constitutes a bilayer of hydrophilic ends</div>, <div class="options mceEditable">The lipid content of the membrane is more than that of the protein</div>, <div class="options mceEditable">The proteins may either be carriers or receptors only</div>, <div class="options mceEditable">It is a 3-layered lipid structure</div>
I want to split the above string by comma with the condition that comma should appear between >,< or , <div no where else.
The expected output:
['<div class="options mceEditable">The membrane is a dynamic structure, and its constituents are in constant movement.</div>',
'<div class="options mceEditable">The lipids component of the membrane constitutes a bilayer of hydrophilic ends</div>',
'<div class="options mceEditable">The lipid content of the membrane is more than that of the protein</div>',
'<div class="options mceEditable">The proteins may either be carriers or receptors only</div>',
'<div class="options mceEditable">It is a 3-layered lipid structure</div>']
What I tried:
options = test3.split(">, <")
options=options.replace("</div'","</div>'")
the above two methods did not yield the result.
Can someone help please?
Answers:
You can use BeautifulSoup
:
# pip install bs4
import bs4
soup = bs4.BeautifulSoup(s)
divs = [str(div) for div in soup.find_all('div')]
Output:
>>> divs
['<div class="options mceEditable">The membrane is a dynamic structure, and its constituents are in constant movement.</div>',
'<div class="options mceEditable">The lipids component of the membrane constitutes a bilayer of hydrophilic ends</div>',
'<div class="options mceEditable">The lipid content of the membrane is more than that of the protein</div>',
'<div class="options mceEditable">The proteins may either be carriers or receptors only</div>',
'<div class="options mceEditable">It is a 3-layered lipid structure</div>']
Normally I wouldn’t advise regexes on anything related to XML/HTML, but since your input is some processed form of those and no longer valid, I’d say it is acceptable to use regexes in this case, if you can’t fix it at the source of data:
import re
s = '<div class="options mceEditable">The membrane is a dynamic structure, and its constituents are in constant movement.</div>, <div class="options mceEditable">The lipids component of the membrane constitutes a bilayer of hydrophilic ends</div>, <div class="options mceEditable">The lipid content of the membrane is more than that of the protein</div>, <div class="options mceEditable">The proteins may either be carriers or receptors only</div>, <div class="options mceEditable">It is a 3-layered lipid structure</div>'
pattern = r'<div class="options mceEditable">.*?</div>'
matches = re.findall(pattern, s, re.U)
for m in matches:
print(m)
Output:
<div class="options mceEditable">The membrane is a dynamic structure, and its constituents are in constant movement.</div>
<div class="options mceEditable">The lipids component of the membrane constitutes a bilayer of hydrophilic ends</div>
<div class="options mceEditable">The lipid content of the membrane is more than that of the protein</div>
<div class="options mceEditable">The proteins may either be carriers or receptors only</div>
<div class="options mceEditable">It is a 3-layered lipid structure</div>
count = text.count("</div>")
text = text.split("</div>,")
m = 1
for i in text :
if m < count :
print(i, end= "</div>," + 'n')
m = m + 1
else :
print(i, end = 'n')
I have a string:
<div class="options mceEditable">The membrane is a dynamic structure, and its constituents are in constant movement.</div>, <div class="options mceEditable">The lipids component of the membrane constitutes a bilayer of hydrophilic ends</div>, <div class="options mceEditable">The lipid content of the membrane is more than that of the protein</div>, <div class="options mceEditable">The proteins may either be carriers or receptors only</div>, <div class="options mceEditable">It is a 3-layered lipid structure</div>
I want to split the above string by comma with the condition that comma should appear between >,< or , <div no where else.
The expected output:
['<div class="options mceEditable">The membrane is a dynamic structure, and its constituents are in constant movement.</div>',
'<div class="options mceEditable">The lipids component of the membrane constitutes a bilayer of hydrophilic ends</div>',
'<div class="options mceEditable">The lipid content of the membrane is more than that of the protein</div>',
'<div class="options mceEditable">The proteins may either be carriers or receptors only</div>',
'<div class="options mceEditable">It is a 3-layered lipid structure</div>']
What I tried:
options = test3.split(">, <")
options=options.replace("</div'","</div>'")
the above two methods did not yield the result.
Can someone help please?
You can use BeautifulSoup
:
# pip install bs4
import bs4
soup = bs4.BeautifulSoup(s)
divs = [str(div) for div in soup.find_all('div')]
Output:
>>> divs
['<div class="options mceEditable">The membrane is a dynamic structure, and its constituents are in constant movement.</div>',
'<div class="options mceEditable">The lipids component of the membrane constitutes a bilayer of hydrophilic ends</div>',
'<div class="options mceEditable">The lipid content of the membrane is more than that of the protein</div>',
'<div class="options mceEditable">The proteins may either be carriers or receptors only</div>',
'<div class="options mceEditable">It is a 3-layered lipid structure</div>']
Normally I wouldn’t advise regexes on anything related to XML/HTML, but since your input is some processed form of those and no longer valid, I’d say it is acceptable to use regexes in this case, if you can’t fix it at the source of data:
import re
s = '<div class="options mceEditable">The membrane is a dynamic structure, and its constituents are in constant movement.</div>, <div class="options mceEditable">The lipids component of the membrane constitutes a bilayer of hydrophilic ends</div>, <div class="options mceEditable">The lipid content of the membrane is more than that of the protein</div>, <div class="options mceEditable">The proteins may either be carriers or receptors only</div>, <div class="options mceEditable">It is a 3-layered lipid structure</div>'
pattern = r'<div class="options mceEditable">.*?</div>'
matches = re.findall(pattern, s, re.U)
for m in matches:
print(m)
Output:
<div class="options mceEditable">The membrane is a dynamic structure, and its constituents are in constant movement.</div>
<div class="options mceEditable">The lipids component of the membrane constitutes a bilayer of hydrophilic ends</div>
<div class="options mceEditable">The lipid content of the membrane is more than that of the protein</div>
<div class="options mceEditable">The proteins may either be carriers or receptors only</div>
<div class="options mceEditable">It is a 3-layered lipid structure</div>
count = text.count("</div>")
text = text.split("</div>,")
m = 1
for i in text :
if m < count :
print(i, end= "</div>," + 'n')
m = m + 1
else :
print(i, end = 'n')