Python BeautifulSoup: Extract grouped text after <br> tags
Question:
I’m trying to parse an html file to groups of texts items in a CSV using BeautifulSoup but I’m unsure how to parse the pattern. I am new to python and beautiful soup.
The html file looks kinda like this:
<html>
<body>
<br>
<br>
<b>Group 1 title</b>
<br>
<pre> Group 1 description which may or may not be here</pre>
<br>
Group 1 property: Blah blah blah
<br>
<br>
<b>Group 2 title</b>
<br>
Group 2 property: Blah blah blah
</body>
</html>
Essentially I need to parse a massive html file which is grouped a certain way and I’m trying to parse all the groups based on title, a possible description, and group property into a csv. I’m just stuck on how to parse text after what seems to be 2
tags alongside actually being able to differentiate the groups.
Please let me know on possible approaches.
Answers:
You can group the text by the group title found inside the <b>
tag. For example:
from bs4 import BeautifulSoup
html = """
<body>
<br>
<br>
<b>Group 1 title</b>
<br>
<pre> Group 1 description which may or may not be here</pre>
<br>
Group 1 property: Blah blah blah
<br>
<br>
<b>Group 2 title</b>
<br>
Group 2 property: Blah blah blah
</body>
</html>
"""
soup = BeautifulSoup(html, "html.parser")
out = {}
for t in soup.body.find_all(text=True):
prev_b = t.find_previous("b")
if not prev_b:
continue
if t.find_parent("b"):
continue
t = t.strip()
if t:
out.setdefault(prev_b.text, []).append(t)
print(out)
Prints:
{
"Group 1 title": [
"Group 1 description which may or may not be here",
"Group 1 property: Blah blah blah",
],
"Group 2 title": ["Group 2 property: Blah blah blah"],
}
I’m trying to parse an html file to groups of texts items in a CSV using BeautifulSoup but I’m unsure how to parse the pattern. I am new to python and beautiful soup.
The html file looks kinda like this:
<html>
<body>
<br>
<br>
<b>Group 1 title</b>
<br>
<pre> Group 1 description which may or may not be here</pre>
<br>
Group 1 property: Blah blah blah
<br>
<br>
<b>Group 2 title</b>
<br>
Group 2 property: Blah blah blah
</body>
</html>
Essentially I need to parse a massive html file which is grouped a certain way and I’m trying to parse all the groups based on title, a possible description, and group property into a csv. I’m just stuck on how to parse text after what seems to be 2
tags alongside actually being able to differentiate the groups.
Please let me know on possible approaches.
You can group the text by the group title found inside the <b>
tag. For example:
from bs4 import BeautifulSoup
html = """
<body>
<br>
<br>
<b>Group 1 title</b>
<br>
<pre> Group 1 description which may or may not be here</pre>
<br>
Group 1 property: Blah blah blah
<br>
<br>
<b>Group 2 title</b>
<br>
Group 2 property: Blah blah blah
</body>
</html>
"""
soup = BeautifulSoup(html, "html.parser")
out = {}
for t in soup.body.find_all(text=True):
prev_b = t.find_previous("b")
if not prev_b:
continue
if t.find_parent("b"):
continue
t = t.strip()
if t:
out.setdefault(prev_b.text, []).append(t)
print(out)
Prints:
{
"Group 1 title": [
"Group 1 description which may or may not be here",
"Group 1 property: Blah blah blah",
],
"Group 2 title": ["Group 2 property: Blah blah blah"],
}