Problem scraping Amazon using requests: I get blocked even when using cookie and headers. I can only scrape using a browser. Any solution?

Question:

The requests module isn’t working anymore for me when trying to scrape amazon, I’ve tried using cookies, headers, changing IP’s but nothing really works other than scraping through a browser. Does anyone know how they’re able to do it and if there’s a good work around using requests?

The real odd thing is that the request when sent through cURL returns the page, but if I turn it into python code it returns a captcha request that I can’t see in my browser and doesn’t go away even with cookies.

For example this cURL request returns the Amazon main page, but when truend into python it returns a captcha request:

curl -L -vvv http://amazon.com -H "User-Agent:Mozilla 5.0"

This is my current code, I copied the curl request directly from the browser and turned into python code, still not working:

import requests

cookies = {
    'session-id': '135-4585428-6195300',
    'session-id-time': '2082787201l',
    'i18n-prefs': 'USD',
    'sp-cdn': '"L5Z9:IL"',
    'ubid-main': '132-1503580-7678418',
    'session-token': 'R5XVE3t8VeX8bRwnjuxXwONDgBnxkngfLfzobFxK5HL+8QaofrVEPjv8Mvta3D6EMlaiFeOyhjjiHkHLjjRwlh9seQ0wsfXE0BU0csh2Wtx6q6r630bsx5VvbBIQcyVAPRkgvL5wgU12P39t5iCZ7b3ykFjRvb9qe7eScZC/F9DJ+NuFMOVP+Z7OQtlZNQzcYrKmWTJH0HJZho8VtJBish0ATwfLhVI+Ihu1ioHYUfSUNDdjQFgG7SyiKZDufkXekZZGaF3x24vY9haBeJVnE9GjmMN+XHySuQtP/stlZmhlp9JOH17+JTZHVsCn/SEONdK5QhETXzoaQ+9YvptxA+v49bgXJn+L',
    'csm-hit': 'tb:NBK78382HSSRXD9W22YX+s-SKXXAE4EMPQ2XYNGK1G0|1692968547644&t:1692968547644&adb:adblk_no',
}

headers = {
    'authority': 'www.amazon.com',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'accept-language': 'en-US,en;q=0.9',
    'cache-control': 'max-age=0',
    # 'cookie': 'session-id=135-4585428-6195300; session-id-time=2082787201l; i18n-prefs=USD; sp-cdn="L5Z9:IL"; ubid-main=132-1503580-7678418; session-token=R5XVE3t8VeX8bRwnjuxXwONDgBnxkngfLfzobFxK5HL+8QaofrVEPjv8Mvta3D6EMlaiFeOyhjjiHkHLjjRwlh9seQ0wsfXE0BU0csh2Wtx6q6r630bsx5VvbBIQcyVAPRkgvL5wgU12P39t5iCZ7b3ykFjRvb9qe7eScZC/F9DJ+NuFMOVP+Z7OQtlZNQzcYrKmWTJH0HJZho8VtJBish0ATwfLhVI+Ihu1ioHYUfSUNDdjQFgG7SyiKZDufkXekZZGaF3x24vY9haBeJVnE9GjmMN+XHySuQtP/stlZmhlp9JOH17+JTZHVsCn/SEONdK5QhETXzoaQ+9YvptxA+v49bgXJn+L; csm-hit=tb:NBK78382HSSRXD9W22YX+s-SKXXAE4EMPQ2XYNGK1G0|1692968547644&t:1692968547644&adb:adblk_no',
    'device-memory': '8',
    'downlink': '10',
    'dpr': '1',
    'ect': '4g',
    'rtt': '100',
    'sec-ch-device-memory': '8',
    'sec-ch-dpr': '1',
    'sec-ch-ua': '"Chromium";v="116", "Not)A;Brand";v="24", "Microsoft Edge";v="116"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Windows"',
    'sec-ch-ua-platform-version': '"10.0.0"',
    'sec-ch-viewport-width': '1037',
    'sec-fetch-dest': 'document',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-site': 'none',
    'sec-fetch-user': '?1',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36 Edg/116.0.1938.54',
    'viewport-width': '1037',
}

response = requests.get('https://www.amazon.com/dp/B002G9UDYG', cookies=cookies, headers=headers)
Asked By: MarinaF

||

Answers:

I don’t think that you can scrape Amazon with Python Requests even if you try to use information extract from a valid browser session.

basic curl connection

curl -I http://www.amazon.com

The response below shows that the URL is using Amazon CloudFront and has a status code of 301, which tell us that the URL is being permanently redirect to some other URL

HTTP/1.1 301 Moved Permanently
Server: CloudFront
Date: Wed, 30 Aug 2023 12:41:38 GMT
Content-Type: text/html
Content-Length: 167
Connection: keep-alive
Location: https://www.amazon.com/
X-Cache: Redirect from cloudfront
Via: 1.1 322b7a8ce3aa88236c8ca9410d0b9300.cloudfront.net (CloudFront)
X-Amz-Cf-Pop: ATL58-P3
Alt-Svc: h3=":443"; ma=86400
X-Amz-Cf-Id: oK3dFCUCiQ6ZdAe_BEC5p-XbRxcrXFiYupSaYQOh6W1JS85BJsLrKA==

Python Requests

import requests
response = requests.get('https://www.amazon.com/')

print(response.status_code)
503 

print(response.headers)
{'Content-Type': 'text/html', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Server': 'Server', 'Date': 'Wed, 30 Aug 2023 12:52:08 GMT', 'x-amz-rid': 'YXG2PT1GB1T7GY4Q19KC', 'Vary': 'Content-Type,Accept-Encoding,User-Agent', 'Last-Modified': 'Mon, 12 Jun 2023 22:17:25 GMT', 'ETag': '"a6f-5fdf615518740-gzip"', 'Accept-Ranges': 'bytes', 'Content-Encoding': 'gzip', 'Strict-Transport-Security': 'max-age=47474747; includeSubDomains; preload', 'X-Cache': 'Error from cloudfront', 'Via': '1.1 71cf657de17d1d4de9dbcb4ff38d54c0.cloudfront.net (CloudFront)', 'X-Amz-Cf-Pop': 'ATL56-P1', 'Alt-Svc': 'h3=":443"; ma=86400', 'X-Amz-Cf-Id': 'Rxe_ROuUee2QLLxW7e8tVqbJ4WwRK3JXbhjxrgV-WXwrb0q6pdzdbg=='}

The status code 503 indicates that the server is temporarily unable to handle the request. The headers show that Amazon CloudFront is not allowing the connection.

If we exam the content of the page (response.text) you will see this:

To discuss automated access to Amazon data please contact [email protected]. For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com/ref=rm_5_sv

Based on the information Amazon is trying to prevent someone from scraping their site with tools, such as Python Requests. I would recommend trying selenium or Amazon’s API.

Here are some sites that highlight how to use selenium to scrape Amazon:

Answered By: Life is complex

Here’s the Python Code
Its pretty simple,

headers = {
‘User-Agent’: ”,
}

request = requests.get(url=’https://amazon.com’, headers=headers)
print(request)

Answered By: user23568537