Multiple H3 tags – but only need a specific one with web scraping

Question

How do I target a text within specific H3 tags if there are multiple H3 tags?

I’m currently trying the below code but it only returns the first H3 tag with the string "1" instead of the second one with the "De eerlijke vinder" string – which is the one I need.

Below the python code and part of the HTML code I’m scraping

data = []
books = soup.find_all('section', class_='yf-work')
for book in books:
        rank = book.find('i', class_='yf-checked fa fa-check-square-o').text.strip() if book.find('i', class_='yf-checked fa fa-check-square-o') else None
        title = book.find('h3').text.strip() if book.find_all('h3') else None
        author = book.h4.text.strip()
        title2 = book.select("div.yf-anchor anchor h3").text.strip() if book.select("div.yf-anchor anchor h3") else None
        genre = book.find('a', class_='btn btn4 yf-genre').text.strip() if book.find('a', class_='btn btn4 yf-genre') else None

<section class="yf-work" data-id="453371" data-show="true">
                        <div class="row-fluid">

                            <div class="colspan-1 xs-colspan-2">
                                <span class="work-stats work-stats-pink yf-check">
                                    <h3>
                                        <i class="yf-unchecked fa fa-square-o" style="display: none"></i>
                                        <i class="yf-checked fa fa-check-square-o" style="display: none"></i>
                                        1
                                    </h3>
                                </span>
                            </div>

                            <div class="colspan-3  xs-colspan-2">
                                
                            </div>

                            <div class="colspan-8  xs-colspan-8">
                                <div class="yf-anchor anchor" data-url="https://www.hebban.nl/boek/de-eerlijke-vinder-lize-spit">
                                    <h3>
                                        De eerlijke vinder<br>
                                    </h3>

                                    <h4>Lize Spit </h4>

                                    <br>

Update

We’re talking about the book title on this webpage – for the first highlighted item. Which has a different HTML section than the rest of the table: https://www.hebban.nl/rank

The title of the book is "De eerlijke vinder"

Code:


    ### set user-agent ###
    # 
response = requests.get(url,headers={'user-agent':'Mozilla/5.0'})

    ### Parse the HTML content using Beautiful Soup ###
soup = BeautifulSoup(response.content, 'html.parser')

    ### get rank, book title, authors and genre ###
data = []
books = soup.find_all('section', class_='yf-work')
for book in books:
        rank = "1"
        title = book.select('h3.yf-anchor anchor').text.strip() if book.select('h3.yf-anchor anchor') else None
        author = book.h4.text.strip()
        title2 = book.select("div.yf-anchor anchor h3").text.strip() if book.select("div.yf-anchor anchor h3") else None
        title3 = book.text.strip('.yf-anchor anchor h3')
        genre = book.find('a', class_='btn btn4 yf-genre').text.strip() if book.find('a', class_='btn btn4 yf-genre') else None
   

        ### create dateframe ###
        data.append({'rank': rank, 'author': author, 'title': title, 'title2': title2, 'title3': title3, 'genres': genre, 'scraped_date': pd.Timestamp.today().strftime('%Y-%m-%d')})

        df = pd.DataFrame (data)

        print(df)


findme = soup.find_all('div', class_='yf-anchor anchor')
for title in findme:
    second_h3 = title.h3.text.strip()
    print(second_h3)

Asked By: jsb92

||

Source

Answer 1

One way to get <h3> De eerlijke vinder<br> </h3> is by using a CSS selector, specifically a descendant combinator.

.yf-anchor.anchor h3 selects all h3 elements inside any element with the classes yf-anchor and anchor. So, in this case it would only select <h3> De eerlijke vinder<br> </h3>, as there is only one such element.

Due to their only being one such element, I passed .yf-anchor.anchor h3 into the select_one method, so that a single Tag object is returned rather than a list.

from bs4 import BeautifulSoup

html = '''<section class="yf-work" data-id="453371" data-show="true">
                        <div class="row-fluid">

                            <div class="colspan-1 xs-colspan-2">
                                <span class="work-stats work-stats-pink yf-check">
                                    <h3>
                                        <i class="yf-unchecked fa fa-square-o" style="display: none"></i>
                                        <i class="yf-checked fa fa-check-square-o" style="display: none"></i>
                                        1
                                    </h3>
                                </span>
                            </div>

                            <div class="colspan-3  xs-colspan-2">
                                
                            </div>

                            <div class="colspan-8  xs-colspan-8">
                                <div class="yf-anchor anchor" data-url="https://www.hebban.nl/boek/de-eerlijke-vinder-lize-spit">
                                    <h3>
                                        De eerlijke vinder<br>
                                    </h3>

                                    <h4>Lize Spit </h4>

                                    <br>
'''

books = list(BeautifulSoup(html, 'html.parser'))

for book in books:
    second_h3 = book.select_one('.yf-anchor.anchor h3')
    print(second_h3)

Output:

<h3>
                                        De eerlijke vinder<br/>
</h3>

Answered By: Übermensch

Multiple H3 tags – but only need a specific one with web scraping

Question:

Answers: