Python BeautifulSoup: Extract grouped text after <br> tags

Question:

I’m trying to parse an html file to groups of texts items in a CSV using BeautifulSoup but I’m unsure how to parse the pattern. I am new to python and beautiful soup.

The html file looks kinda like this:

<html>
 <body>
  
  <br>
  <br>
   <b>Group 1 title</b>
  <br>
  <pre> Group 1 description which may or may not be here</pre>
  <br>
  Group 1 property: Blah blah blah

  <br>
  <br>
   <b>Group 2 title</b>
  <br>
  Group 2 property: Blah blah blah

 </body>
</html>

Essentially I need to parse a massive html file which is grouped a certain way and I’m trying to parse all the groups based on title, a possible description, and group property into a csv. I’m just stuck on how to parse text after what seems to be 2
tags alongside actually being able to differentiate the groups.

Please let me know on possible approaches.

Asked By: Steven

||

Answers:

You can group the text by the group title found inside the <b> tag. For example:

from bs4 import BeautifulSoup


html = """
 <body>
  
  <br>
  <br>
   <b>Group 1 title</b>
  <br>
  <pre> Group 1 description which may or may not be here</pre>
  <br>
  Group 1 property: Blah blah blah

  <br>
  <br>
   <b>Group 2 title</b>
  <br>
  Group 2 property: Blah blah blah

 </body>
</html>
"""

soup = BeautifulSoup(html, "html.parser")


out = {}
for t in soup.body.find_all(text=True):
    prev_b = t.find_previous("b")

    if not prev_b:
        continue

    if t.find_parent("b"):
        continue

    t = t.strip()
    if t:
        out.setdefault(prev_b.text, []).append(t)

print(out)

Prints:

{
    "Group 1 title": [
        "Group 1 description which may or may not be here",
        "Group 1 property: Blah blah blah",
    ],
    "Group 2 title": ["Group 2 property: Blah blah blah"],
}
Answered By: Andrej Kesely
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.