Trying to web scrape text from a table on a website
Question:
I am a novice at this, but I’ve been trying to scrape data on a website (https://awards.decanter.com/DWWA/2022/search/wines?competitionType=DWWA) but I keep coming up empty. I’ve tried BeautifulSoup and Scrapy but I can’t get the text out.
Eventually I want to get the row of each individual wine in the table into a dataframe/csv (from all pages) but currently I can’t even get the first wine producer name.
If you inspect the webpage all the details are in tags with no id or class.
My BeautifulSoup attempt
URL = 'https://awards.decanter.com/DWWA/2022/search/wines?competitionType=DWWA'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/106.0.0.0 Safari/537.36 Edg/106.0.1370.52"}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, "html.parser")
soup2 = soup.prettify()
producer = soup2.find_all('td').get_text()
print(producer)
Which is throwing the error:
producer = soup2.find_all('td').get_text()
AttributeError: 'str' object has no attribute 'find_all'
My Scrapy attempt
winedf = pd.DataFrame()
class WineSpider(scrapy.Spider):
name = 'wine_spider'
def start_requests(self):
dwwa_url = "https://awards.decanter.com/DWWA/2022/search/wines?competitionType=DWWA"
yield scrapy.Request(url=dwwa_url, callback=self.parse_front)
def parse_front(self, response):
table = response.xpath('//*[@id="root"]/div/div[2]/div[4]/div[2]/table')
page_links = table.xpath('//*[@id="root"]/div/div[2]/div[4]/div[2]/div[2]/div[1]/ul/li[3]/a(@class,
"dwwa-page-link") @href')
links_to_follow = page_links.extract()
for url in links_to_follow:
yield response.follow(url=url, callback=self.parse_pages)
def parse_pages(self, response):
wine_name = Selector(response=response).xpath('//*[@id="root"]/div/div[2]/div[4]/div[2]/table/tbody/
tr[1]/td[1]/text()').get()
wine_name_ext = wine_name.extract().strip()
winedf.append(wine_name_ext)
medal = Selector(response=response).xpath('//*[@id="root"]/div/div[2]/div[4]/div[2]/table/tbody/tr[1]/
td[4]/text()').get()
medal_ext = medal.extract().strip()
winedf.append(medal_ext)
Which produces and empty df.
Any help would be greatly appreciated.
Thank you!
Answers:
Try:
import pandas as pd
url = "https://decanterresultsapi.decanter.com/api/DWWA/2022/wines/search?competitionType=DWWA"
df = pd.read_json(url)
# print last items in df:
print(df.tail().to_markdown())
Prints:
producer
name
id
competition
award
score
country
region
subRegion
vintage
color
style
priceBandLetter
competitionYear
competitionType
14853
Telavi Wine Cellar
Marani
718257
DWWA 2022
7
86
Georgia
Kakheti
Kindzmarauli
2021
Red
Still – Medium (between 19 and 44 g/L residual sugar)
B
2022
DWWA
14854
Štrigova
Muškat Žuti
716526
DWWA 2022
7
87
Croatia
Continental
Zagorje – Međimurje
2021
White
Still – Medium (between 19 and 44 g/L residual sugar)
C
2022
DWWA
14855
Kopjar
Muscat žUti
717754
DWWA 2022
7
86
Croatia
Continental
Zagorje – Međimurje
2021
White
Still – Medium (between 19 and 44 g/L residual sugar)
C
2022
DWWA
14856
Cleebronn-Güglingen
Blanc De Noir Fein & Fruchtig
719836
DWWA 2022
7
87
Germany
Württemberg
Not Applicable
2021
White
Still – Medium (between 19 and 44 g/L residual sugar)
B
2022
DWWA
14857
Winnice Czajkowski
Thoma 8 Grand Selection
719891
DWWA 2022
6
90
Poland
Not Applicable
Not Applicable
2021
White
Still – Medium (between 19 and 44 g/L residual sugar)
D
2022
DWWA
When you load a site you want to scrape, always inspect what it loads with the network monitor. In this case you see that it loads the data dynamically from an api. This means that you can skip scraping altogether and load the data directly from the api into pandas:
import pandas as pd
df = pd.read_json('https://decanterresultsapi.decanter.com/api/DWWA/2022/wines/search?competitionType=DWWA')
Which gives all 14858 items:
producer
name
id
competition
award
score
country
region
subRegion
vintage
color
style
priceBandLetter
competitionYear
competitionType
0
Yealands Estate Wines
Babydoll Sauvignon Blanc
706484
DWWA 2022
7
88
New Zealand
Marlborough
Not Applicable
2021
White
Still – Dry (below 5 g/L residual sugar)
A
2022
DWWA
1
Yealands Estate Wines
Reserve Pinot Gris
706478
DWWA 2022
7
86
New Zealand
Marlborough
Not Applicable
2021
White
Still – Dry (below 5 g/L residual sugar)
B
2022
DWWA
2
Yealands Estate Wines
Babydoll Pinot Gris
706479
DWWA 2022
7
87
New Zealand
Marlborough
Not Applicable
2021
White
Still – Dry (below 5 g/L residual sugar)
A
2022
DWWA
3
Yealands Estate Wines
Reserve Chardonnay
705165
DWWA 2022
6
90
New Zealand
Hawke’s Bay
Not Applicable
2021
White
Still – Dry (below 5 g/L residual sugar)
B
2022
DWWA
4
Yealands Estate Wines
Reserve Sauvignon Blanc
706486
DWWA 2022
6
90
New Zealand
Marlborough
Awatere Valley
2021
White
Still – Dry (below 5 g/L residual sugar)
B
2022
DWWA
I am a novice at this, but I’ve been trying to scrape data on a website (https://awards.decanter.com/DWWA/2022/search/wines?competitionType=DWWA) but I keep coming up empty. I’ve tried BeautifulSoup and Scrapy but I can’t get the text out.
Eventually I want to get the row of each individual wine in the table into a dataframe/csv (from all pages) but currently I can’t even get the first wine producer name.
If you inspect the webpage all the details are in tags with no id or class.
My BeautifulSoup attempt
URL = 'https://awards.decanter.com/DWWA/2022/search/wines?competitionType=DWWA'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/106.0.0.0 Safari/537.36 Edg/106.0.1370.52"}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, "html.parser")
soup2 = soup.prettify()
producer = soup2.find_all('td').get_text()
print(producer)
Which is throwing the error:
producer = soup2.find_all('td').get_text()
AttributeError: 'str' object has no attribute 'find_all'
My Scrapy attempt
winedf = pd.DataFrame()
class WineSpider(scrapy.Spider):
name = 'wine_spider'
def start_requests(self):
dwwa_url = "https://awards.decanter.com/DWWA/2022/search/wines?competitionType=DWWA"
yield scrapy.Request(url=dwwa_url, callback=self.parse_front)
def parse_front(self, response):
table = response.xpath('//*[@id="root"]/div/div[2]/div[4]/div[2]/table')
page_links = table.xpath('//*[@id="root"]/div/div[2]/div[4]/div[2]/div[2]/div[1]/ul/li[3]/a(@class,
"dwwa-page-link") @href')
links_to_follow = page_links.extract()
for url in links_to_follow:
yield response.follow(url=url, callback=self.parse_pages)
def parse_pages(self, response):
wine_name = Selector(response=response).xpath('//*[@id="root"]/div/div[2]/div[4]/div[2]/table/tbody/
tr[1]/td[1]/text()').get()
wine_name_ext = wine_name.extract().strip()
winedf.append(wine_name_ext)
medal = Selector(response=response).xpath('//*[@id="root"]/div/div[2]/div[4]/div[2]/table/tbody/tr[1]/
td[4]/text()').get()
medal_ext = medal.extract().strip()
winedf.append(medal_ext)
Which produces and empty df.
Any help would be greatly appreciated.
Thank you!
Try:
import pandas as pd
url = "https://decanterresultsapi.decanter.com/api/DWWA/2022/wines/search?competitionType=DWWA"
df = pd.read_json(url)
# print last items in df:
print(df.tail().to_markdown())
Prints:
producer | name | id | competition | award | score | country | region | subRegion | vintage | color | style | priceBandLetter | competitionYear | competitionType | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
14853 | Telavi Wine Cellar | Marani | 718257 | DWWA 2022 | 7 | 86 | Georgia | Kakheti | Kindzmarauli | 2021 | Red | Still – Medium (between 19 and 44 g/L residual sugar) | B | 2022 | DWWA |
14854 | Štrigova | Muškat Žuti | 716526 | DWWA 2022 | 7 | 87 | Croatia | Continental | Zagorje – Međimurje | 2021 | White | Still – Medium (between 19 and 44 g/L residual sugar) | C | 2022 | DWWA |
14855 | Kopjar | Muscat žUti | 717754 | DWWA 2022 | 7 | 86 | Croatia | Continental | Zagorje – Međimurje | 2021 | White | Still – Medium (between 19 and 44 g/L residual sugar) | C | 2022 | DWWA |
14856 | Cleebronn-Güglingen | Blanc De Noir Fein & Fruchtig | 719836 | DWWA 2022 | 7 | 87 | Germany | Württemberg | Not Applicable | 2021 | White | Still – Medium (between 19 and 44 g/L residual sugar) | B | 2022 | DWWA |
14857 | Winnice Czajkowski | Thoma 8 Grand Selection | 719891 | DWWA 2022 | 6 | 90 | Poland | Not Applicable | Not Applicable | 2021 | White | Still – Medium (between 19 and 44 g/L residual sugar) | D | 2022 | DWWA |
When you load a site you want to scrape, always inspect what it loads with the network monitor. In this case you see that it loads the data dynamically from an api. This means that you can skip scraping altogether and load the data directly from the api into pandas:
import pandas as pd
df = pd.read_json('https://decanterresultsapi.decanter.com/api/DWWA/2022/wines/search?competitionType=DWWA')
Which gives all 14858 items:
producer | name | id | competition | award | score | country | region | subRegion | vintage | color | style | priceBandLetter | competitionYear | competitionType | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Yealands Estate Wines | Babydoll Sauvignon Blanc | 706484 | DWWA 2022 | 7 | 88 | New Zealand | Marlborough | Not Applicable | 2021 | White | Still – Dry (below 5 g/L residual sugar) | A | 2022 | DWWA |
1 | Yealands Estate Wines | Reserve Pinot Gris | 706478 | DWWA 2022 | 7 | 86 | New Zealand | Marlborough | Not Applicable | 2021 | White | Still – Dry (below 5 g/L residual sugar) | B | 2022 | DWWA |
2 | Yealands Estate Wines | Babydoll Pinot Gris | 706479 | DWWA 2022 | 7 | 87 | New Zealand | Marlborough | Not Applicable | 2021 | White | Still – Dry (below 5 g/L residual sugar) | A | 2022 | DWWA |
3 | Yealands Estate Wines | Reserve Chardonnay | 705165 | DWWA 2022 | 6 | 90 | New Zealand | Hawke’s Bay | Not Applicable | 2021 | White | Still – Dry (below 5 g/L residual sugar) | B | 2022 | DWWA |
4 | Yealands Estate Wines | Reserve Sauvignon Blanc | 706486 | DWWA 2022 | 6 | 90 | New Zealand | Marlborough | Awatere Valley | 2021 | White | Still – Dry (below 5 g/L residual sugar) | B | 2022 | DWWA |