Web scraping reviews from Amazon only returns data for the first page
Question:
I am trying to scrape reviews from Amazon. The reviews can appear on multiple pages to scrape more than one page I construct a list of links which I later scrape separately:
# Construct list of links to scrape multiple pages
links = []
for x in range(1,5):
links.append(f'https://www.amazon.de/-/en/SanDisk-microSDHC-memory-adapter-performance/product-reviews/B08GY9NYRM/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews&pageNumber={x}')
I then use requests and beautiful soup to obtain the raw review data as below:
# Scrape all links in the constructed list
reviews = []
for link in links:
html = requests.get(link, headers=HEADERS)
if html.status_code == 200:
# HTML response was sucssesfull
soup = BeautifulSoup(html.text, 'html.parser')
results = soup.find_all('span', {'data-hook': 'review-body'})
print(len(results))
for review in results:
reviews.append(review.text.replace('n', ''))
else:
# HTML response was unsuccsessfull
print('[BAD HTML RESPONSE] Response Code =', html.status_code)
Each page contains 10 Reviews and I receive all 10 reviews for the first page (&pageNumber=1), in each following page I do not receive any information.
When checking the corresponding soup objects I cant find the review information. Why is this?
I tried only scraping page 2 outside of the for loop but no review information is returned.
Two months ago I tried the same code which worked on over 80 pages. I do not understand why it is not working now (has Amazon changed something?) Thanks for your time and help!
Answers:
The reason why soup
doesn’t contain any review information is because Amazon is returning a page with a CAPTCHA rather than the actual page with the product reviews.
You can verify this by dumping the returned HTML into a file and opening it in your browser:
with open("example.html") as f:
f.write(str(soup))
I happened to come across the same exact problem as you. Did abit of research, turns out you would need to give proper headers (not just the user-agent). I’m not sure what header you used but this works for me:
go to http://httpbin.org/get
Copy everything under "headers", but remove "Host", and paste it as your header!
Hopefully, this works for you!
You can solve this problem by giving the correct headers.
header = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9",
"Upgrade-Insecure-Requests": "1",
"Referer": "https://www.google.com/",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"
}
try using this header
Read this blog, if you want to know more about headers.
https://www.zenrows.com/blog/web-scraping-headers#what-are-http-headers
I am trying to scrape reviews from Amazon. The reviews can appear on multiple pages to scrape more than one page I construct a list of links which I later scrape separately:
# Construct list of links to scrape multiple pages
links = []
for x in range(1,5):
links.append(f'https://www.amazon.de/-/en/SanDisk-microSDHC-memory-adapter-performance/product-reviews/B08GY9NYRM/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews&pageNumber={x}')
I then use requests and beautiful soup to obtain the raw review data as below:
# Scrape all links in the constructed list
reviews = []
for link in links:
html = requests.get(link, headers=HEADERS)
if html.status_code == 200:
# HTML response was sucssesfull
soup = BeautifulSoup(html.text, 'html.parser')
results = soup.find_all('span', {'data-hook': 'review-body'})
print(len(results))
for review in results:
reviews.append(review.text.replace('n', ''))
else:
# HTML response was unsuccsessfull
print('[BAD HTML RESPONSE] Response Code =', html.status_code)
Each page contains 10 Reviews and I receive all 10 reviews for the first page (&pageNumber=1), in each following page I do not receive any information.
When checking the corresponding soup objects I cant find the review information. Why is this?
I tried only scraping page 2 outside of the for loop but no review information is returned.
Two months ago I tried the same code which worked on over 80 pages. I do not understand why it is not working now (has Amazon changed something?) Thanks for your time and help!
The reason why soup
doesn’t contain any review information is because Amazon is returning a page with a CAPTCHA rather than the actual page with the product reviews.
You can verify this by dumping the returned HTML into a file and opening it in your browser:
with open("example.html") as f:
f.write(str(soup))
I happened to come across the same exact problem as you. Did abit of research, turns out you would need to give proper headers (not just the user-agent). I’m not sure what header you used but this works for me:
go to http://httpbin.org/get
Copy everything under "headers", but remove "Host", and paste it as your header!
Hopefully, this works for you!
You can solve this problem by giving the correct headers.
header = { "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7", "Accept-Encoding": "gzip, deflate, br", "Accept-Language": "en-US,en;q=0.9", "Upgrade-Insecure-Requests": "1", "Referer": "https://www.google.com/", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36" }
try using this header
Read this blog, if you want to know more about headers.
https://www.zenrows.com/blog/web-scraping-headers#what-are-http-headers