Multiple H3 tags – but only need a specific one with web scraping
Question:
How do I target a text within specific H3 tags if there are multiple H3 tags?
I’m currently trying the below code but it only returns the first H3 tag with the string "1" instead of the second one with the "De eerlijke vinder" string – which is the one I need.
Below the python code and part of the HTML code I’m scraping
data = []
books = soup.find_all('section', class_='yf-work')
for book in books:
rank = book.find('i', class_='yf-checked fa fa-check-square-o').text.strip() if book.find('i', class_='yf-checked fa fa-check-square-o') else None
title = book.find('h3').text.strip() if book.find_all('h3') else None
author = book.h4.text.strip()
title2 = book.select("div.yf-anchor anchor h3").text.strip() if book.select("div.yf-anchor anchor h3") else None
genre = book.find('a', class_='btn btn4 yf-genre').text.strip() if book.find('a', class_='btn btn4 yf-genre') else None
<section class="yf-work" data-id="453371" data-show="true">
<div class="row-fluid">
<div class="colspan-1 xs-colspan-2">
<span class="work-stats work-stats-pink yf-check">
<h3>
<i class="yf-unchecked fa fa-square-o" style="display: none"></i>
<i class="yf-checked fa fa-check-square-o" style="display: none"></i>
1
</h3>
</span>
</div>
<div class="colspan-3 xs-colspan-2">
</div>
<div class="colspan-8 xs-colspan-8">
<div class="yf-anchor anchor" data-url="https://www.hebban.nl/boek/de-eerlijke-vinder-lize-spit">
<h3>
De eerlijke vinder<br>
</h3>
<h4>Lize Spit </h4>
<br>
Update
We’re talking about the book title on this webpage – for the first highlighted item. Which has a different HTML section than the rest of the table: https://www.hebban.nl/rank
The title of the book is "De eerlijke vinder"
Code:
### set user-agent ###
#
response = requests.get(url,headers={'user-agent':'Mozilla/5.0'})
### Parse the HTML content using Beautiful Soup ###
soup = BeautifulSoup(response.content, 'html.parser')
### get rank, book title, authors and genre ###
data = []
books = soup.find_all('section', class_='yf-work')
for book in books:
rank = "1"
title = book.select('h3.yf-anchor anchor').text.strip() if book.select('h3.yf-anchor anchor') else None
author = book.h4.text.strip()
title2 = book.select("div.yf-anchor anchor h3").text.strip() if book.select("div.yf-anchor anchor h3") else None
title3 = book.text.strip('.yf-anchor anchor h3')
genre = book.find('a', class_='btn btn4 yf-genre').text.strip() if book.find('a', class_='btn btn4 yf-genre') else None
### create dateframe ###
data.append({'rank': rank, 'author': author, 'title': title, 'title2': title2, 'title3': title3, 'genres': genre, 'scraped_date': pd.Timestamp.today().strftime('%Y-%m-%d')})
df = pd.DataFrame (data)
print(df)
findme = soup.find_all('div', class_='yf-anchor anchor')
for title in findme:
second_h3 = title.h3.text.strip()
print(second_h3)
Answers:
One way to get <h3> De eerlijke vinder<br> </h3>
is by using a CSS selector, specifically a descendant combinator.
.yf-anchor.anchor h3
selects all h3
elements inside any element with the classes yf-anchor
and anchor
. So, in this case it would only select <h3> De eerlijke vinder<br> </h3>
, as there is only one such element.
Due to their only being one such element, I passed .yf-anchor.anchor h3
into the select_one
method, so that a single Tag object is returned rather than a list.
from bs4 import BeautifulSoup
html = '''<section class="yf-work" data-id="453371" data-show="true">
<div class="row-fluid">
<div class="colspan-1 xs-colspan-2">
<span class="work-stats work-stats-pink yf-check">
<h3>
<i class="yf-unchecked fa fa-square-o" style="display: none"></i>
<i class="yf-checked fa fa-check-square-o" style="display: none"></i>
1
</h3>
</span>
</div>
<div class="colspan-3 xs-colspan-2">
</div>
<div class="colspan-8 xs-colspan-8">
<div class="yf-anchor anchor" data-url="https://www.hebban.nl/boek/de-eerlijke-vinder-lize-spit">
<h3>
De eerlijke vinder<br>
</h3>
<h4>Lize Spit </h4>
<br>
'''
books = list(BeautifulSoup(html, 'html.parser'))
for book in books:
second_h3 = book.select_one('.yf-anchor.anchor h3')
print(second_h3)
Output:
<h3>
De eerlijke vinder<br/>
</h3>
How do I target a text within specific H3 tags if there are multiple H3 tags?
I’m currently trying the below code but it only returns the first H3 tag with the string "1" instead of the second one with the "De eerlijke vinder" string – which is the one I need.
Below the python code and part of the HTML code I’m scraping
data = []
books = soup.find_all('section', class_='yf-work')
for book in books:
rank = book.find('i', class_='yf-checked fa fa-check-square-o').text.strip() if book.find('i', class_='yf-checked fa fa-check-square-o') else None
title = book.find('h3').text.strip() if book.find_all('h3') else None
author = book.h4.text.strip()
title2 = book.select("div.yf-anchor anchor h3").text.strip() if book.select("div.yf-anchor anchor h3") else None
genre = book.find('a', class_='btn btn4 yf-genre').text.strip() if book.find('a', class_='btn btn4 yf-genre') else None
<section class="yf-work" data-id="453371" data-show="true">
<div class="row-fluid">
<div class="colspan-1 xs-colspan-2">
<span class="work-stats work-stats-pink yf-check">
<h3>
<i class="yf-unchecked fa fa-square-o" style="display: none"></i>
<i class="yf-checked fa fa-check-square-o" style="display: none"></i>
1
</h3>
</span>
</div>
<div class="colspan-3 xs-colspan-2">
</div>
<div class="colspan-8 xs-colspan-8">
<div class="yf-anchor anchor" data-url="https://www.hebban.nl/boek/de-eerlijke-vinder-lize-spit">
<h3>
De eerlijke vinder<br>
</h3>
<h4>Lize Spit </h4>
<br>
Update
We’re talking about the book title on this webpage – for the first highlighted item. Which has a different HTML section than the rest of the table: https://www.hebban.nl/rank
The title of the book is "De eerlijke vinder"
Code:
### set user-agent ###
#
response = requests.get(url,headers={'user-agent':'Mozilla/5.0'})
### Parse the HTML content using Beautiful Soup ###
soup = BeautifulSoup(response.content, 'html.parser')
### get rank, book title, authors and genre ###
data = []
books = soup.find_all('section', class_='yf-work')
for book in books:
rank = "1"
title = book.select('h3.yf-anchor anchor').text.strip() if book.select('h3.yf-anchor anchor') else None
author = book.h4.text.strip()
title2 = book.select("div.yf-anchor anchor h3").text.strip() if book.select("div.yf-anchor anchor h3") else None
title3 = book.text.strip('.yf-anchor anchor h3')
genre = book.find('a', class_='btn btn4 yf-genre').text.strip() if book.find('a', class_='btn btn4 yf-genre') else None
### create dateframe ###
data.append({'rank': rank, 'author': author, 'title': title, 'title2': title2, 'title3': title3, 'genres': genre, 'scraped_date': pd.Timestamp.today().strftime('%Y-%m-%d')})
df = pd.DataFrame (data)
print(df)
findme = soup.find_all('div', class_='yf-anchor anchor')
for title in findme:
second_h3 = title.h3.text.strip()
print(second_h3)
One way to get <h3> De eerlijke vinder<br> </h3>
is by using a CSS selector, specifically a descendant combinator.
.yf-anchor.anchor h3
selects all h3
elements inside any element with the classes yf-anchor
and anchor
. So, in this case it would only select <h3> De eerlijke vinder<br> </h3>
, as there is only one such element.
Due to their only being one such element, I passed .yf-anchor.anchor h3
into the select_one
method, so that a single Tag object is returned rather than a list.
from bs4 import BeautifulSoup
html = '''<section class="yf-work" data-id="453371" data-show="true">
<div class="row-fluid">
<div class="colspan-1 xs-colspan-2">
<span class="work-stats work-stats-pink yf-check">
<h3>
<i class="yf-unchecked fa fa-square-o" style="display: none"></i>
<i class="yf-checked fa fa-check-square-o" style="display: none"></i>
1
</h3>
</span>
</div>
<div class="colspan-3 xs-colspan-2">
</div>
<div class="colspan-8 xs-colspan-8">
<div class="yf-anchor anchor" data-url="https://www.hebban.nl/boek/de-eerlijke-vinder-lize-spit">
<h3>
De eerlijke vinder<br>
</h3>
<h4>Lize Spit </h4>
<br>
'''
books = list(BeautifulSoup(html, 'html.parser'))
for book in books:
second_h3 = book.select_one('.yf-anchor.anchor h3')
print(second_h3)
Output:
<h3>
De eerlijke vinder<br/>
</h3>