Error status code 403 even with headers, Python Requests

Question:

I am sending a request to some url. I Copied the curl url to get the code from curl to python tool. So all the headers are included, but my request is not working and I recieve status code 403 on printing and error code 1020 in the html output. The code is

import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    # 'Accept-Encoding': 'gzip, deflate, br',
    'DNT': '1',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1',
    'Sec-Fetch-Dest': 'document',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Site': 'none',
    'Sec-Fetch-User': '?1',
}

response = requests.get('https://v2.gcchmc.org/book-appointment/', headers=headers)

print(response.status_code)
print(response.cookies.get_dict())
with open("test.html",'w') as f:
    f.write(response.text)

I also get cookies but not getting the desired response. I know I can do it with selenium but I want to know the reason behind this. Thanks in advance.
Note:
I have installed all the libraries installed with request with same version as computer and still not working and throwing 403 error

Asked By: farhan jatt

||

Answers:

It works on my machine, so I am not sure what the problem is.

However, when I want send a request which does not work, I often try if it works using playwright. Playwright uses a browser driver and thus mimics your actual browser when visiting the page. It can be installed using pip install playwright. When you try it for the first time it may give an error which tells you to install the drivers, just follow the instruction to do so.

With playwright you can try the following:

from playwright.sync_api import sync_playwright


url = 'https://v2.gcchmc.org/book-appointment/'
ua = (
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
    "AppleWebKit/537.36 (KHTML, like Gecko) "
    "Chrome/69.0.3497.100 Safari/537.36"
)

with sync_playwright() as p:
    browser = p.chromium.launch(headless=False)
    page = browser.new_page(user_agent=ua)
    page.goto(url)
    page.wait_for_timeout(1000)
    
    html = page.content()
    
print(html)

A downside of playwright is that it requires the installation of the chromium (or other) browsers. This is a downside as it may complicate deployment, as the browser can not simply be added to requirements.txt, and a container image is required.

Answered By: Jeroen Vermunt

The site is protected by cloudflare which aims to block, among other things, unauthorized data scraping. From What is data scraping?

The process of web scraping is fairly simple, though the
implementation can be complex. Web scraping occurs in 3 steps:

  1. First the piece of code used to pull the information, which we call a scraper bot, sends an HTTP GET request to a specific website.
  2. When the website responds, the scraper parses the HTML document for a specific pattern of data.
  3. Once the data is extracted, it is converted into whatever specific format the scraper bot’s author designed.

You can use urllib instead of requests, it seems to be able to deal with cloudflare

req = urllib.request.Request('https://v2.gcchmc.org/book-appointment/')
req.add_headers('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0')
req.add_header('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8')
req.add_header('Accept-Language', 'en-US,en;q=0.5')

r = urllib.request.urlopen(req).read().decode('utf-8')
with open("test.html", 'w', encoding="utf-8") as f:
    f.write(r)
Answered By: Guy

Try running Burp Suite‘s Proxy to see all the headers and other data like cookies. Then you could mimic the request with the Python module. That’s what I always do.

Good luck!

Answered By: hexa

Had the same problem recently.

Using the javascript fetch-api with Selenium-Profiles worked for me.

example js:

fetch('http://example.com/movies.json')
  .then((response) => response.json())
  .then((data) => console.log(data));o

Example Python with Selenium-Profiles:

headers = {
        "accept": "application/json",
        "accept-encoding": "gzip, deflate, br",
        "accept-language": profile["cdp"]["useragent"]["acceptLanguage"],
        "content-type": "application/json",
        # "cookie": cookie_str,  # optional
        "sec-ch-ua": "'Google Chrome';v='107', 'Chromium';v='107', 'Not=A?Brand';v='24'",
        "sec-ch-ua-mobile": "?0",  # "?1" for mobile
        "sec-ch-ua-platform": "'" + profile['cdp']['useragent']['userAgentMetadata']['platform'] + "'",
        "sec-fetch-dest": "empty",
        "sec-fetch-mode": "cors",
        "user-agent": profile['cdp']['useragent']['userAgent']
    }

answer = driver.requests.fetch("https://www.example.com/",
                         options={
                                 "body": json.dumps(post_data),
                                  "headers": headers,
                                  "method":"POST",
                                  "mode":"same-origin"
                         })

I don’t know why this occurs, but I assume cloudfare and others are able to detect, whether a request is made with javascript.

Answered By: kaliiiiiiiii
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.