How to select the previous tag when re finds the str
Question:
I have an HTML file like this:(More than 100 records)
<div class="cell-62 pl-1 pt-0_5">
<h3 class="very-big-text light-text">John Smith</h3>
<span class="light-text">Center - VAR - Employee I</span>
</div>
<div class="cell-62 pl-1 pt-0_5">
<h3 class="very-big-text light-text">Jenna Smith</h3>
<span class="light-text">West - VAR - Employee I</span>
</div>
<div class="cell-62 pl-1 pt-0_5">
<h3 class="very-big-text light-text">Jordan Smith</h3>
<span class="light-text">East - VAR - Employee II</span>
</div>
I need to extract the names IF they are Employee I, which makes it challenging. How can I select those tags that have Employee I in the next tag? Or should I use a different method? Is it even possible to use condition in this case?
with open("file.html", 'r') as input:
html = input.read()
print(re.search(r'bEmployee Ib',html).group(0))
Like, how can I specify to go to read previous tag?
Answers:
import re
from bs4 import BeautifulSoup
with open('inputfile.html', encoding='utf-8') as fp:
soup = BeautifulSoup(fp.read(), 'html.parser')
names = [span.parent.find('h3').string
for span in
soup.find_all('span',
class_='light-text',
string=re.compile('Employee I$'))
]
print(names)
gives
['John Smith', 'Jenna Smith']
I’ve formatted the list comprehension over several lines, for clarity, so that it may be easier to see where to adjust things accordingly to other use cases. Of course, a normal for-loop and appending to a list also works fine; I just like list comprehensions.
The re.compile('Employee I$')
is necessary to avoid matching on 'Employee II'
. The class_
argument is an extra, and may not be needed.
The rest is near self-explanatory, especially with the BeautifulSoup documentation next to it.
Note that if the .string
attribute used to be .text
, in case you’re using an older version of BeautifulSoup.
from bs4 import BeautifulSoup
test = '''<div class="cell-62 pl-1 pt-0_5">
<h3 class="very-big-text light-text">John Smith</h3>
<span class="light-text">Center - VAR - Employee I</span>
</div>
<div class="cell-62 pl-1 pt-0_5">
<h3 class="very-big-text light-text">Jenna Smith</h3>
<span class="light-text">West - VAR - Employee I</span>
</div>
<div class="cell-62 pl-1 pt-0_5">
<h3 class="very-big-text light-text">Jordan Smith</h3>
<span class="light-text">East - VAR - Employee II</span>
</div>'''
soup = BeautifulSoup(test)
for person in soup.findAll('div'):
names = person.find('h3').text
employee_nb = person.find('span').text.split('-')[2].strip()
if employee_nb == "Employee I":
print(names)
You could also use css selectors
to select your elements more specific.
As of version 4.7.0, Beautiful Soup supports most CSS4 selectors via the SoupSieve project. If you installed Beautiful Soup through pip, SoupSieve was installed at the same time, so you don’t have to do anything extra.
Example
from bs4 import BeautifulSoup
html = '''
<div class="cell-62 pl-1 pt-0_5">
<h3 class="very-big-text light-text">John Smith</h3>
<span class="light-text">Center - VAR - Employee I</span>
</div>
<div class="cell-62 pl-1 pt-0_5">
<h3 class="very-big-text light-text">Jenna Smith</h3>
<span class="light-text">West - VAR - Employee I</span>
</div>
<div class="cell-62 pl-1 pt-0_5">
<h3 class="very-big-text light-text">Jordan Smith</h3>
<span class="light-text">East - VAR - Employee II</span>
</div>
'''
soup = BeautifulSoup(html)
[e.text for e in soup.select('h3:has(+:-soup-contains("Employee"))')]
Output
['John Smith', 'Jenna Smith', 'Jordan Smith']
I have an HTML file like this:(More than 100 records)
<div class="cell-62 pl-1 pt-0_5">
<h3 class="very-big-text light-text">John Smith</h3>
<span class="light-text">Center - VAR - Employee I</span>
</div>
<div class="cell-62 pl-1 pt-0_5">
<h3 class="very-big-text light-text">Jenna Smith</h3>
<span class="light-text">West - VAR - Employee I</span>
</div>
<div class="cell-62 pl-1 pt-0_5">
<h3 class="very-big-text light-text">Jordan Smith</h3>
<span class="light-text">East - VAR - Employee II</span>
</div>
I need to extract the names IF they are Employee I, which makes it challenging. How can I select those tags that have Employee I in the next tag? Or should I use a different method? Is it even possible to use condition in this case?
with open("file.html", 'r') as input:
html = input.read()
print(re.search(r'bEmployee Ib',html).group(0))
Like, how can I specify to go to read previous tag?
import re
from bs4 import BeautifulSoup
with open('inputfile.html', encoding='utf-8') as fp:
soup = BeautifulSoup(fp.read(), 'html.parser')
names = [span.parent.find('h3').string
for span in
soup.find_all('span',
class_='light-text',
string=re.compile('Employee I$'))
]
print(names)
gives
['John Smith', 'Jenna Smith']
I’ve formatted the list comprehension over several lines, for clarity, so that it may be easier to see where to adjust things accordingly to other use cases. Of course, a normal for-loop and appending to a list also works fine; I just like list comprehensions.
The re.compile('Employee I$')
is necessary to avoid matching on 'Employee II'
. The class_
argument is an extra, and may not be needed.
The rest is near self-explanatory, especially with the BeautifulSoup documentation next to it.
Note that if the .string
attribute used to be .text
, in case you’re using an older version of BeautifulSoup.
from bs4 import BeautifulSoup
test = '''<div class="cell-62 pl-1 pt-0_5">
<h3 class="very-big-text light-text">John Smith</h3>
<span class="light-text">Center - VAR - Employee I</span>
</div>
<div class="cell-62 pl-1 pt-0_5">
<h3 class="very-big-text light-text">Jenna Smith</h3>
<span class="light-text">West - VAR - Employee I</span>
</div>
<div class="cell-62 pl-1 pt-0_5">
<h3 class="very-big-text light-text">Jordan Smith</h3>
<span class="light-text">East - VAR - Employee II</span>
</div>'''
soup = BeautifulSoup(test)
for person in soup.findAll('div'):
names = person.find('h3').text
employee_nb = person.find('span').text.split('-')[2].strip()
if employee_nb == "Employee I":
print(names)
You could also use css selectors
to select your elements more specific.
As of version 4.7.0, Beautiful Soup supports most CSS4 selectors via the SoupSieve project. If you installed Beautiful Soup through pip, SoupSieve was installed at the same time, so you don’t have to do anything extra.
Example
from bs4 import BeautifulSoup
html = '''
<div class="cell-62 pl-1 pt-0_5">
<h3 class="very-big-text light-text">John Smith</h3>
<span class="light-text">Center - VAR - Employee I</span>
</div>
<div class="cell-62 pl-1 pt-0_5">
<h3 class="very-big-text light-text">Jenna Smith</h3>
<span class="light-text">West - VAR - Employee I</span>
</div>
<div class="cell-62 pl-1 pt-0_5">
<h3 class="very-big-text light-text">Jordan Smith</h3>
<span class="light-text">East - VAR - Employee II</span>
</div>
'''
soup = BeautifulSoup(html)
[e.text for e in soup.select('h3:has(+:-soup-contains("Employee"))')]
Output
['John Smith', 'Jenna Smith', 'Jordan Smith']