Scraping <span> "text" </span> with BeautifulSoup
Question:
I am working on scraping the data from a website using BeautifulSoup. For whatever reason, I cannot seem to find a way to get the text between span elements to print. Here is what I am running.
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}
url = 'https://www.amazon.com/GymCope-Anti-Tear-Cushioning-Non-Slip-Exercise/dp/B0921F1T2P/ref=sr_1_3_sspa?brr=1&pd_rd_r=4b40f0a8-f2d8-44dc-9a98-413c64d3fa34&pd_rd_w=P9ZJI&pd_rd_wg=RS7zW&pf_rd_p=9875e817-188b-48a2-986d-8146749644ac&pf_rd_r=AGWBT5KT04TYKGPZASKA&qid=1642452438&rd=1&rnid=3407731&s=sporting-goods&sr=1-3-spons&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUExVjhWTk0xQU5WWldPJmVuY3J5cHRlZElkPUEwODE0MzYwMTdMTDZSNDVST08yMiZlbmNyeXB0ZWRBZElkPUEwODQ4MDM0MlE4WEtVUjFKMUdLMiZ3aWRnZXROYW1lPXNwX2F0Zl9icm93c2UmYWN0aW9uPWNsaWNrUmVkaXJlY3QmZG9Ob3RMb2dDbGljaz10cnVl'
response = requests.get(url, headers=headers)
html = response.text
soup = BeautifulSoup(html)
bsr = soup.find("div", class_="a-section table-padding").text
and seeing this,
>>> bsr
' ASIN B0921F1T2P Customer Reviews nn n 4.6 out of 5 stars n 41 ratings nnn 4.6 out of 5 stars Best Sellers Rank #69,660 in Sports & Outdoors (See Top 100 in Sports & Outdoors) #234 in Yoga Mats Date First Available April 8, 2021 '
I tried
bsra = soup.find("div", class_="a-section table-padding").find_next('span').get_text()
but it comes out
> > > bsr
> > > '\n 4.6 out of 5 stars '
I want only to scrape "Best Sellers Rank" as in the picture. Thanks.
Answers:
Referenced picture in your question is missing, but you can get rank by selecting your elements more specific:
soup.select_one('th:-soup-contains("Best Sellers Rank") + td').text.split()[0]
Example
import requests
from bs4 import BeautifulSoup
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}
url = 'https://www.amazon.com/GymCope-Anti-Tear-Cushioning-Non-Slip-Exercise/dp/B0921F1T2P/ref=sr_1_3_sspa?brr=1&pd_rd_r=4b40f0a8-f2d8-44dc-9a98-413c64d3fa34&pd_rd_w=P9ZJI&pd_rd_wg=RS7zW&pf_rd_p=9875e817-188b-48a2-986d-8146749644ac&pf_rd_r=AGWBT5KT04TYKGPZASKA&qid=1642452438&rd=1&rnid=3407731&s=sporting-goods&sr=1-3-spons&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUExVjhWTk0xQU5WWldPJmVuY3J5cHRlZElkPUEwODE0MzYwMTdMTDZSNDVST08yMiZlbmNyeXB0ZWRBZElkPUEwODQ4MDM0MlE4WEtVUjFKMUdLMiZ3aWRnZXROYW1lPXNwX2F0Zl9icm93c2UmYWN0aW9uPWNsaWNrUmVkaXJlY3QmZG9Ob3RMb2dDbGljaz10cnVl'
response = requests.get(url, headers=headers)
html = response.text
soup = BeautifulSoup(html)
soup.select_one('th:-soup-contains("Best Sellers Rank") + td').text.split()[0]
Output
#84,712
I am working on scraping the data from a website using BeautifulSoup. For whatever reason, I cannot seem to find a way to get the text between span elements to print. Here is what I am running.
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}
url = 'https://www.amazon.com/GymCope-Anti-Tear-Cushioning-Non-Slip-Exercise/dp/B0921F1T2P/ref=sr_1_3_sspa?brr=1&pd_rd_r=4b40f0a8-f2d8-44dc-9a98-413c64d3fa34&pd_rd_w=P9ZJI&pd_rd_wg=RS7zW&pf_rd_p=9875e817-188b-48a2-986d-8146749644ac&pf_rd_r=AGWBT5KT04TYKGPZASKA&qid=1642452438&rd=1&rnid=3407731&s=sporting-goods&sr=1-3-spons&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUExVjhWTk0xQU5WWldPJmVuY3J5cHRlZElkPUEwODE0MzYwMTdMTDZSNDVST08yMiZlbmNyeXB0ZWRBZElkPUEwODQ4MDM0MlE4WEtVUjFKMUdLMiZ3aWRnZXROYW1lPXNwX2F0Zl9icm93c2UmYWN0aW9uPWNsaWNrUmVkaXJlY3QmZG9Ob3RMb2dDbGljaz10cnVl'
response = requests.get(url, headers=headers)
html = response.text
soup = BeautifulSoup(html)
bsr = soup.find("div", class_="a-section table-padding").text
and seeing this,
>>> bsr
' ASIN B0921F1T2P Customer Reviews nn n 4.6 out of 5 stars n 41 ratings nnn 4.6 out of 5 stars Best Sellers Rank #69,660 in Sports & Outdoors (See Top 100 in Sports & Outdoors) #234 in Yoga Mats Date First Available April 8, 2021 '
I tried
bsra = soup.find("div", class_="a-section table-padding").find_next('span').get_text()
but it comes out
> > > bsr
> > > '\n 4.6 out of 5 stars '
I want only to scrape "Best Sellers Rank" as in the picture. Thanks.
Referenced picture in your question is missing, but you can get rank by selecting your elements more specific:
soup.select_one('th:-soup-contains("Best Sellers Rank") + td').text.split()[0]
Example
import requests
from bs4 import BeautifulSoup
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}
url = 'https://www.amazon.com/GymCope-Anti-Tear-Cushioning-Non-Slip-Exercise/dp/B0921F1T2P/ref=sr_1_3_sspa?brr=1&pd_rd_r=4b40f0a8-f2d8-44dc-9a98-413c64d3fa34&pd_rd_w=P9ZJI&pd_rd_wg=RS7zW&pf_rd_p=9875e817-188b-48a2-986d-8146749644ac&pf_rd_r=AGWBT5KT04TYKGPZASKA&qid=1642452438&rd=1&rnid=3407731&s=sporting-goods&sr=1-3-spons&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUExVjhWTk0xQU5WWldPJmVuY3J5cHRlZElkPUEwODE0MzYwMTdMTDZSNDVST08yMiZlbmNyeXB0ZWRBZElkPUEwODQ4MDM0MlE4WEtVUjFKMUdLMiZ3aWRnZXROYW1lPXNwX2F0Zl9icm93c2UmYWN0aW9uPWNsaWNrUmVkaXJlY3QmZG9Ob3RMb2dDbGljaz10cnVl'
response = requests.get(url, headers=headers)
html = response.text
soup = BeautifulSoup(html)
soup.select_one('th:-soup-contains("Best Sellers Rank") + td').text.split()[0]
Output
#84,712