Web scraping reviews from Amazon only returns data for the first page

Question:

I am trying to scrape reviews from Amazon. The reviews can appear on multiple pages to scrape more than one page I construct a list of links which I later scrape separately:

# Construct list of links to scrape multiple pages
links = []
for x in range(1,5):
    links.append(f'https://www.amazon.de/-/en/SanDisk-microSDHC-memory-adapter-performance/product-reviews/B08GY9NYRM/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews&pageNumber={x}')

I then use requests and beautiful soup to obtain the raw review data as below:

# Scrape all links in the constructed list
reviews = []
for link in links:
    html = requests.get(link, headers=HEADERS)
    if html.status_code == 200:
        # HTML response was sucssesfull
        soup = BeautifulSoup(html.text, 'html.parser')
        results = soup.find_all('span', {'data-hook': 'review-body'})
        print(len(results))
        for review in results:
            reviews.append(review.text.replace('n', ''))
    else:
        # HTML response was unsuccsessfull
        print('[BAD HTML RESPONSE] Response Code =', html.status_code)

Each page contains 10 Reviews and I receive all 10 reviews for the first page (&pageNumber=1), in each following page I do not receive any information.

Output of above code

When checking the corresponding soup objects I cant find the review information. Why is this?

I tried only scraping page 2 outside of the for loop but no review information is returned.

Two months ago I tried the same code which worked on over 80 pages. I do not understand why it is not working now (has Amazon changed something?) Thanks for your time and help!

Asked By: Leon Ranke

||

Answers:

The reason why soup doesn’t contain any review information is because Amazon is returning a page with a CAPTCHA rather than the actual page with the product reviews.
You can verify this by dumping the returned HTML into a file and opening it in your browser:

with open("example.html") as f:
  f.write(str(soup))
Answered By: user21600943

I happened to come across the same exact problem as you. Did abit of research, turns out you would need to give proper headers (not just the user-agent). I’m not sure what header you used but this works for me:

go to http://httpbin.org/get
Copy everything under "headers", but remove "Host", and paste it as your header!

Hopefully, this works for you!

Answered By: renazxcv

You can solve this problem by giving the correct headers.

header = { 
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7", 
    "Accept-Encoding": "gzip, deflate, br", 
    "Accept-Language": "en-US,en;q=0.9", 
    "Upgrade-Insecure-Requests": "1", 
    "Referer": "https://www.google.com/",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"
}

try using this header

Read this blog, if you want to know more about headers.
https://www.zenrows.com/blog/web-scraping-headers#what-are-http-headers

Answered By: Prudhvi