How to extract specific link within a div?
Question:
I have a soup
with content like following many <div>
, those who I’m interested in are which have the class foo
In each <div>
, there are a lot of links and other content, I’m interested in the second link (second <a> </a>
) => it is always the second.
I want to grab the value of href
attribute and the text between the second link tag <a> </a>
for example :
<div class ="foo">
<a href ="http://example.com"> </a>
<a href ="http://example2.com"> Title here </a>
</div>
<div class ="foo">
<a href ="http://example3.com"> </a>
<a href ="http://example4.com"> Title 2 here </a>
</div>
here I want to get :
Title here => http://example2.com
Title 2 here => http://example4.com
I’ve tried writing some code :
soup.findAll("div", { "class" : "foo" })
but that returns a list with all divs and their content and I don’t know how to go further
thanks 🙂
Answers:
Iterate div
s and find a
there.
from bs4 import BeautifulSoup
example = '''
<div class ="foo">
<a href ="http://example.com"> </a>
<a href ="http://example2.com"> Title here </a>
</div>
<div class ="foo">
<a href ="http://example3.com"> </a>
<a href ="http://example4.com"> Title 2 here </a>
'''
soup = BeautifulSoup(example)
for div in soup.findAll('div', {'class': 'foo'}):
a = div.findAll('a')[1]
print a.text.strip(), '=>', a.attrs['href']
UPDATE
Times change and new versions of BeautifulSoup
come out
As of version 4.7.0, Beautiful Soup supports most CSS4 selectors via the SoupSieve project.
So you can alternatively use css-selectors
like :nth-of-type(2)
to get all expected elements as dict
with text and href
value:
dict((a.text,a.get('href')) for a in soup.select('div.foo a:nth-of-type(2)'))
Note: In newer code avoid old syntax findAll()
instead use find_all()
or select()
with css selectors
– For more take a minute to check docs
Example
from bs4 import BeautifulSoup
html = '''
<div class ="foo">
<a href ="http://example.com"> </a>
<a href ="http://example2.com"> Title here </a>
</div>
<div class ="foo">
<a href ="http://example3.com"> </a>
<a href ="http://example4.com"> Title 2 here </a>
</div>
'''
soup = BeautifulSoup(html)
dict((a.text,a.get('href')) for a in soup.select('div.foo a:nth-of-type(2)'))
Output
{' Title here ': 'http://example2.com',
' Title 2 here ': 'http://example4.com'}
I have a soup
with content like following many <div>
, those who I’m interested in are which have the class foo
In each <div>
, there are a lot of links and other content, I’m interested in the second link (second <a> </a>
) => it is always the second.
I want to grab the value of href
attribute and the text between the second link tag <a> </a>
for example :
<div class ="foo">
<a href ="http://example.com"> </a>
<a href ="http://example2.com"> Title here </a>
</div>
<div class ="foo">
<a href ="http://example3.com"> </a>
<a href ="http://example4.com"> Title 2 here </a>
</div>
here I want to get :
Title here => http://example2.com
Title 2 here => http://example4.com
I’ve tried writing some code :
soup.findAll("div", { "class" : "foo" })
but that returns a list with all divs and their content and I don’t know how to go further
thanks 🙂
Iterate div
s and find a
there.
from bs4 import BeautifulSoup
example = '''
<div class ="foo">
<a href ="http://example.com"> </a>
<a href ="http://example2.com"> Title here </a>
</div>
<div class ="foo">
<a href ="http://example3.com"> </a>
<a href ="http://example4.com"> Title 2 here </a>
'''
soup = BeautifulSoup(example)
for div in soup.findAll('div', {'class': 'foo'}):
a = div.findAll('a')[1]
print a.text.strip(), '=>', a.attrs['href']
UPDATE
Times change and new versions of BeautifulSoup
come out
As of version 4.7.0, Beautiful Soup supports most CSS4 selectors via the SoupSieve project.
So you can alternatively use css-selectors
like :nth-of-type(2)
to get all expected elements as dict
with text and href
value:
dict((a.text,a.get('href')) for a in soup.select('div.foo a:nth-of-type(2)'))
Note: In newer code avoid old syntax findAll()
instead use find_all()
or select()
with css selectors
– For more take a minute to check docs
Example
from bs4 import BeautifulSoup
html = '''
<div class ="foo">
<a href ="http://example.com"> </a>
<a href ="http://example2.com"> Title here </a>
</div>
<div class ="foo">
<a href ="http://example3.com"> </a>
<a href ="http://example4.com"> Title 2 here </a>
</div>
'''
soup = BeautifulSoup(html)
dict((a.text,a.get('href')) for a in soup.select('div.foo a:nth-of-type(2)'))
Output
{' Title here ': 'http://example2.com',
' Title 2 here ': 'http://example4.com'}