Eaxtract area from html parsed by BeautifulSoup
Question:
The html looks as follows:
<div style="margin-bottom:2px;"><strong>Part1:</strong></div>
<ul>
<li>This is part one.</li>
<li>Please extract me</li>
</ul>
<div style="margin-bottom:2px;"><strong>PartTwo:</strong></div>
<ul>
<li>This is part 2</li>
<li>This has not to be extracted</li>
</ul>
In fact I just want to extract Part1. Means extracting
"This is part one" AND "Please extract me".
I have tackled the problem with soup and text but I think this is not the correct approach as it only extracts "Part1: "..:
soup = BeautifulSoup(html_document, 'html.parser')
part1 = soup(text=lambda t: "Part1:" in t.text)
part1
And something as following (list comprehension) does not work as it also includes PartTwo:
for ul in soup:
for li in soup.findAll('li'):
print(li)
So in fact I only want to extract the first strong tag with name "Part1:".
Answers:
How about trying this:
from bs4 import BeautifulSoup
html_sample = """<div style="margin-bottom:2px;"><strong>Part1:</strong></div>
<ul>
<li>This is part one.</li>
<li>Please extract me</li>
</ul>
<div style="margin-bottom:2px;"><strong>PartTwo:</strong></div>
<ul>
<li>This is part 2</li>
<li>This has not to be extracted</li>
</ul>"""
soup = (
BeautifulSoup(html_sample, "lxml")
.select_one("div[style='margin-bottom:2px;'] + ul")
.select("li")
)
print("n".join([li.getText() for li in soup]))
Output:
This is part one.
Please extract me
The html looks as follows:
<div style="margin-bottom:2px;"><strong>Part1:</strong></div>
<ul>
<li>This is part one.</li>
<li>Please extract me</li>
</ul>
<div style="margin-bottom:2px;"><strong>PartTwo:</strong></div>
<ul>
<li>This is part 2</li>
<li>This has not to be extracted</li>
</ul>
In fact I just want to extract Part1. Means extracting
"This is part one" AND "Please extract me".
I have tackled the problem with soup and text but I think this is not the correct approach as it only extracts "Part1: "..:
soup = BeautifulSoup(html_document, 'html.parser')
part1 = soup(text=lambda t: "Part1:" in t.text)
part1
And something as following (list comprehension) does not work as it also includes PartTwo:
for ul in soup:
for li in soup.findAll('li'):
print(li)
So in fact I only want to extract the first strong tag with name "Part1:".
How about trying this:
from bs4 import BeautifulSoup
html_sample = """<div style="margin-bottom:2px;"><strong>Part1:</strong></div>
<ul>
<li>This is part one.</li>
<li>Please extract me</li>
</ul>
<div style="margin-bottom:2px;"><strong>PartTwo:</strong></div>
<ul>
<li>This is part 2</li>
<li>This has not to be extracted</li>
</ul>"""
soup = (
BeautifulSoup(html_sample, "lxml")
.select_one("div[style='margin-bottom:2px;'] + ul")
.select("li")
)
print("n".join([li.getText() for li in soup]))
Output:
This is part one.
Please extract me