Eaxtract area from html parsed by BeautifulSoup

Question

The html looks as follows:

<div style="margin-bottom:2px;"><strong>Part1:</strong></div>
<ul>
<li>This is part one.</li>
<li>Please extract me</li>
</ul>
&nbsp;
<div style="margin-bottom:2px;"><strong>PartTwo:</strong></div>
<ul>
<li>This is part 2</li>
<li>This has not to be extracted</li>

</ul>

In fact I just want to extract Part1. Means extracting
"This is part one" AND "Please extract me".
I have tackled the problem with soup and text but I think this is not the correct approach as it only extracts "Part1: "..:

soup = BeautifulSoup(html_document, 'html.parser')

part1 = soup(text=lambda t: "Part1:" in t.text)
part1

And something as following (list comprehension) does not work as it also includes PartTwo:

for ul in soup:
    for li in soup.findAll('li'):
        print(li)

So in fact I only want to extract the first strong tag with name "Part1:".

Asked By: question12

||

Source

Answer 1

How about trying this:

from bs4 import BeautifulSoup

html_sample = """<div style="margin-bottom:2px;"><strong>Part1:</strong></div>
<ul>
<li>This is part one.</li>
<li>Please extract me</li>
</ul>
&nbsp;
<div style="margin-bottom:2px;"><strong>PartTwo:</strong></div>
<ul>
<li>This is part 2</li>
<li>This has not to be extracted</li>

</ul>"""

soup = (
    BeautifulSoup(html_sample, "lxml")
    .select_one("div[style='margin-bottom:2px;'] + ul")
    .select("li")
)
print("n".join([li.getText() for li in soup]))

Output:

This is part one.
Please extract me

Answered By: baduker

Eaxtract area from html parsed by BeautifulSoup

Question:

Answers: