How to extract specific link within a div?

Question:

I have a soup with content like following many <div>, those who I’m interested in are which have the class foo

In each <div>, there are a lot of links and other content, I’m interested in the second link (second <a> </a>) => it is always the second.

I want to grab the value of href attribute and the text between the second link tag <a> </a>

for example :

<div class ="foo">
     <a href ="http://example.com"> </a>
     <a href ="http://example2.com"> Title here </a>
</div>

<div class ="foo">
     <a href ="http://example3.com"> </a>
     <a href ="http://example4.com"> Title 2 here </a>
</div>

here I want to get :

Title here => http://example2.com

Title 2 here => http://example4.com

I’ve tried writing some code :

soup.findAll("div", { "class" : "foo" })

but that returns a list with all divs and their content and I don’t know how to go further

thanks 🙂

Asked By: Merna

||

Answers:

Iterate divs and find a there.

from bs4 import BeautifulSoup

example = '''
<div class ="foo">
     <a href ="http://example.com"> </a>
     <a href ="http://example2.com"> Title here </a>
</div>

<div class ="foo">
     <a href ="http://example3.com"> </a>
     <a href ="http://example4.com"> Title 2 here </a>
'''

soup = BeautifulSoup(example)
for div in soup.findAll('div', {'class': 'foo'}):
    a = div.findAll('a')[1]
    print a.text.strip(), '=>', a.attrs['href']
Answered By: falsetru

UPDATE

Times change and new versions of BeautifulSoupcome out

As of version 4.7.0, Beautiful Soup supports most CSS4 selectors via the SoupSieve project.

So you can alternatively use css-selectors like :nth-of-type(2) to get all expected elements as dict with text and href value:

dict((a.text,a.get('href')) for a in soup.select('div.foo a:nth-of-type(2)'))

Note: In newer code avoid old syntax findAll() instead use find_all() or select() with css selectors – For more take a minute to check docs

Example

from bs4 import BeautifulSoup

html = '''
<div class ="foo">
     <a href ="http://example.com"> </a>
     <a href ="http://example2.com"> Title here </a>
</div>

<div class ="foo">
     <a href ="http://example3.com"> </a>
     <a href ="http://example4.com"> Title 2 here </a>
</div>
'''
soup = BeautifulSoup(html)

dict((a.text,a.get('href')) for a in soup.select('div.foo a:nth-of-type(2)'))

Output

{' Title here ': 'http://example2.com',
 ' Title 2 here ': 'http://example4.com'}
Answered By: HedgeHog