Get all text data from multiple classes with the same name with selenium in python

Question:

I am trying to make a web scraper in python using selenium, and would like to get the text from embedded h3 tags, as well as the text in an "a" tag. The basic structure of the website is below.

<div class="class_name">
        <h3>
             <a href="link that I do NOT want">Text That I want</a>
        </h3>
        <a href="Link that I want"></a>
</div>
<div class="class_name">
        <h3>
             <a href="link that I do NOT want">More text that I want</a>
        </h3>
        <a href="another link that I want"></a>
</div>

How would I go about doing this? I’ve looked at xpath solutions as well as using

get_elements(By.CLASS_NAME, "class_name")

but I can’t seem to get anything to work. I was thinking of getting each class location and iterating through each of them separately, but I have no clue how to do that. Any help is appreciated!

Asked By: eth

||

Answers:

I managed to pull out the second links through .contents. d.a['href'] somehow refers to an internal link in <h3>?

from bs4 import BeautifulSoup

data = """

<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
        <title>Web page example</title>
    </head>
    <body>
        <h1>Caption</h1>
        <!-- Comment -->
        <p>First paragraph.</p>
        <p>Second paragraph.</p>
        <div class="class_name">
            <h3>
                <a href="link that I do NOT want">Text That I want</a>
            </h3>
            <a href="Link that I want"></a>
        </div>
        <div class="class_name">
            <h3>
                <a href="link that I do NOT want">More text that I want</a>
            </h3>
            <a href="another link that I want"></a>
        </div>
    </body>
</html>
"""


soup = BeautifulSoup(data, features="lxml")
divs = soup.find_all('div', class_="class_name")

for d in divs:
    print(f"text = {d.h3.a.text}")
    print(f"text = {d.a['href']}?????????????")
    print(f"href = {d.contents[3]['href']}!!!")
text = Text That I want
text = link that I do NOT want?????????????
href = Link that I want!!!
text = More text that I want
text = link that I do NOT want?????????????
href = another link that I want!!!
Answered By: Сергей Кох
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.