Beautifulsoup – how to get text from <span>'s
Question:
I’m trying to scrape a website. All is going fine, but I want to find the text between <span>
. I can retrieve the 1st one, but I cant get to the next ones.
This is the html excerpt:
<ul class="product-small-specs" data-test="product-specs">
<li>
<span>Engels</span>
</li>
<li>
<span>Hardcover</span>
</li>
<li>
<span>9780141395838</span>
</li>
<li>
<span>Druk: New ed</span>
</li>
<li>
<span>oktober 2014</span>
</li>
<li>
<span>352 pagina's</span>
</li>
</ul>
When I try this:
xxx.span.text
I get 'Engels'
(which is ok).
But how do I get the text between the next ‘span’s?
xxx.span.next_sibling
gives '/n'
Any help would be highly appreciated.
edit:
The url is this
rec_all = soup.find_all("ul", class_="product-small-specs")
rec = soup.find("ul", class_="product-small-specs")
for iets in rec_all:
for a in iets:
print(a.span.text)
print(a.span.next_sibling)
Answers:
You can use find_all("span")
to get list with all <span>
and then you can use for
-loop to get text from every item on list
from bs4 import BeautifulSoup as BS
text = '''<ul class="product-small-specs" data-test="product-specs">
<li>
<span>Engels</span>
</li>
<li>
<span>Hardcover</span>
</li>
<li>
<span>9780141395838</span>
</li>
<li>
<span>Druk: New ed</span>
</li>
<li>
<span>oktober 2014</span>
</li>
<li>
<span>352 pagina's</span>
</li>
</ul>'''
soup = BS(text, 'html.parser')
all_items = soup.find_all('span')
for item in all_items:
print(item.text)
Result:
Engels
Hardcover
9780141395838
Druk: New ed
oktober 2014
352 pagina's
EDIT:
If you need all <span>
in selected <ul>
then you can use
ul = soup.find('ul', class_="product-small-specs")
all_items = ul.find_all('span') # search only inside `ul`
for item in all_items:
print(item.text)
EDIT:
If you have more ul
and more span
in li
then you can use nested for
-loops
soup = BS(text, 'html.parser')
for ul in soup.find_all("ul", class_="product-small-specs"):
print('--- ul ---')
for li in ul.find_all('li'):
print(' --- li ---')
for span in li.find_all('span'):
print(' span:', span.text)
Result:
--- ul ---
--- li ---
span: Engels
--- li ---
span: Hardcover
--- li ---
span: 9780141395838
--- li ---
span: Druk: New ed
--- li ---
span: oktober 2014
--- li ---
span: 352 pagina's
I’m trying to scrape a website. All is going fine, but I want to find the text between <span>
. I can retrieve the 1st one, but I cant get to the next ones.
This is the html excerpt:
<ul class="product-small-specs" data-test="product-specs">
<li>
<span>Engels</span>
</li>
<li>
<span>Hardcover</span>
</li>
<li>
<span>9780141395838</span>
</li>
<li>
<span>Druk: New ed</span>
</li>
<li>
<span>oktober 2014</span>
</li>
<li>
<span>352 pagina's</span>
</li>
</ul>
When I try this:
xxx.span.text
I get 'Engels'
(which is ok).
But how do I get the text between the next ‘span’s?
xxx.span.next_sibling
gives '/n'
Any help would be highly appreciated.
edit:
The url is this
rec_all = soup.find_all("ul", class_="product-small-specs")
rec = soup.find("ul", class_="product-small-specs")
for iets in rec_all:
for a in iets:
print(a.span.text)
print(a.span.next_sibling)
You can use find_all("span")
to get list with all <span>
and then you can use for
-loop to get text from every item on list
from bs4 import BeautifulSoup as BS
text = '''<ul class="product-small-specs" data-test="product-specs">
<li>
<span>Engels</span>
</li>
<li>
<span>Hardcover</span>
</li>
<li>
<span>9780141395838</span>
</li>
<li>
<span>Druk: New ed</span>
</li>
<li>
<span>oktober 2014</span>
</li>
<li>
<span>352 pagina's</span>
</li>
</ul>'''
soup = BS(text, 'html.parser')
all_items = soup.find_all('span')
for item in all_items:
print(item.text)
Result:
Engels
Hardcover
9780141395838
Druk: New ed
oktober 2014
352 pagina's
EDIT:
If you need all <span>
in selected <ul>
then you can use
ul = soup.find('ul', class_="product-small-specs")
all_items = ul.find_all('span') # search only inside `ul`
for item in all_items:
print(item.text)
EDIT:
If you have more ul
and more span
in li
then you can use nested for
-loops
soup = BS(text, 'html.parser')
for ul in soup.find_all("ul", class_="product-small-specs"):
print('--- ul ---')
for li in ul.find_all('li'):
print(' --- li ---')
for span in li.find_all('span'):
print(' span:', span.text)
Result:
--- ul ---
--- li ---
span: Engels
--- li ---
span: Hardcover
--- li ---
span: 9780141395838
--- li ---
span: Druk: New ed
--- li ---
span: oktober 2014
--- li ---
span: 352 pagina's