Python – Request being blocked by Cloudflare
Question:
I am trying to log into a website. When I look at print(g.text) I am not getting back the web page I expect but instead a cloudflare page that says ‘Checking your browser before accessing’
import requests
import time
s = requests.Session()
s.get('https://www.off---white.com/en/GB/')
headers = {'Referer': 'https://www.off---white.com/en/GB/login'}
payload = {
'utf8':'✓',
'authenticity_token':'',
'spree_user[email]': '[email protected]',
'spree_user[password]': 'PASSWORD',
'spree_user[remember_me]': '0',
'commit': 'Login'
}
r = s.post('https://www.off---white.com/en/GB/login', data=payload, headers=headers)
print(r.status_code)
g = s.get('https://www.off---white.com/en/GB/account')
print(g.status_code)
print(g.text)
Why is this occurring when I have set the session?
Answers:
This is due to the fact that the page uses Cloudflare’s anti-bot page (or IUAM).
Bypassing this check is quite difficult to solve on your own, since Cloudflare changes their techniques periodically. Currently, they check if the client supports JavaScript, which can be spoofed.
I would recommend using the cfscrape
module for bypassing this.
To install it, use pip install cfscrape
. You’ll also need to install Node.js.
You can pass a requests session into create_scraper()
like so:
session = requests.Session()
session.headers = ...
scraper = cfscrape.create_scraper(sess=session)
You might want to try this:
import cloudscraper
scraper = cloudscraper.create_scraper() # returns a CloudScraper instance
# Or: scraper = cloudscraper.CloudScraper() # CloudScraper inherits from requests.Session
print scraper.get("http://somesite.com").text # => "<!DOCTYPE html><html><head>..."
It does not require Node.js dependency.
All credits go to this pypi page
I had the same problem because they implemented cloudfare in the api, I solved it this way
import cloudscraper
import json
scraper = cloudscraper.create_scraper()
r = scraper.get("MY API").text
y = json.loads(r)
print (y)
curl
and hx
avoid this problem. But how?
I found, they work by default with HTTP/2. But requests
library used only HTTP/1.1.
So, for tests I installed httpx
with h2
python library to support HTTP/2 requests) and it works if I do: httpx --http2 'https://some.url'
.
So, the solution is to use a library that supports http2. For example httpx
with h2
It’s not a complete solution, since it won’t help to solve Cloudflare’s anti-bot ("I’m Under Attack Mode", or IUAM) challenge
You can scrape any Cloudflare protected page by using this tool. Node.js is mandatory in order for the code to work correctly.
Download Node from this link https://nodejs.org/en/
import cfscrape #pip install cfscrape
scraper = cfscrape.create_scraper()
res = scraper.get("https://www.example.com").text
print(res)
You have ran into Cloudflare waiting room or I am under attack page, used to check if the request in made by bot or human. Web application firewalls (WAFs), like Cloudflare, use a variety of techniques to identify you as a bot. The client faces multiple challenges, like hCaptcha and others, which the Python request module can’t resolve.
There are two ways you can approach the Cloudflare’s challenges:
- Complete the challenges and hCaptcha (hard way).
- Avoid the challenges altogether by imitating a real browser (easy way).
The simplest option is to try to be ignored by the WAF by imitating a real browser’s properties. You can use Selenium library, which implements many techniques to avoid triggering the WAFs.
You can use undetected-chromedriver in Selenium:
import undetected_chromedriver as uc
from webdriver_manager.chrome import ChromeDriverManager
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
import time
options = webdriver.ChromeOptions()
chrome_path = ChromeDriverManager().install()
chrome_service = Service(chrome_path)
driver = uc.Chrome(options=options, service=chrome_service, use_subprocess=True)
url = "https://www.off---white.com/en/GB/"
driver.get(url)
time.sleep(5)
#Do something
driver.quit()
There are several factors to take into consideration. One of them is the IP Reputation and, in this case, geolocation. It looks like the site is more aggressive in blocking requests from some regions.
To avoid that, you could use high-quality proxies, preferably ones that allow geolocation. I’d set it to GB to avoid further problems.
More information about how Cloudflare works.
I am trying to log into a website. When I look at print(g.text) I am not getting back the web page I expect but instead a cloudflare page that says ‘Checking your browser before accessing’
import requests
import time
s = requests.Session()
s.get('https://www.off---white.com/en/GB/')
headers = {'Referer': 'https://www.off---white.com/en/GB/login'}
payload = {
'utf8':'✓',
'authenticity_token':'',
'spree_user[email]': '[email protected]',
'spree_user[password]': 'PASSWORD',
'spree_user[remember_me]': '0',
'commit': 'Login'
}
r = s.post('https://www.off---white.com/en/GB/login', data=payload, headers=headers)
print(r.status_code)
g = s.get('https://www.off---white.com/en/GB/account')
print(g.status_code)
print(g.text)
Why is this occurring when I have set the session?
This is due to the fact that the page uses Cloudflare’s anti-bot page (or IUAM).
Bypassing this check is quite difficult to solve on your own, since Cloudflare changes their techniques periodically. Currently, they check if the client supports JavaScript, which can be spoofed.
I would recommend using the cfscrape
module for bypassing this.
To install it, use pip install cfscrape
. You’ll also need to install Node.js.
You can pass a requests session into create_scraper()
like so:
session = requests.Session()
session.headers = ...
scraper = cfscrape.create_scraper(sess=session)
You might want to try this:
import cloudscraper
scraper = cloudscraper.create_scraper() # returns a CloudScraper instance
# Or: scraper = cloudscraper.CloudScraper() # CloudScraper inherits from requests.Session
print scraper.get("http://somesite.com").text # => "<!DOCTYPE html><html><head>..."
It does not require Node.js dependency.
All credits go to this pypi page
I had the same problem because they implemented cloudfare in the api, I solved it this way
import cloudscraper
import json
scraper = cloudscraper.create_scraper()
r = scraper.get("MY API").text
y = json.loads(r)
print (y)
curl
and hx
avoid this problem. But how?
I found, they work by default with HTTP/2. But requests
library used only HTTP/1.1.
So, for tests I installed httpx
with h2
python library to support HTTP/2 requests) and it works if I do: httpx --http2 'https://some.url'
.
So, the solution is to use a library that supports http2. For example httpx
with h2
It’s not a complete solution, since it won’t help to solve Cloudflare’s anti-bot ("I’m Under Attack Mode", or IUAM) challenge
You can scrape any Cloudflare protected page by using this tool. Node.js is mandatory in order for the code to work correctly.
Download Node from this link https://nodejs.org/en/
import cfscrape #pip install cfscrape
scraper = cfscrape.create_scraper()
res = scraper.get("https://www.example.com").text
print(res)
You have ran into Cloudflare waiting room or I am under attack page, used to check if the request in made by bot or human. Web application firewalls (WAFs), like Cloudflare, use a variety of techniques to identify you as a bot. The client faces multiple challenges, like hCaptcha and others, which the Python request module can’t resolve.
There are two ways you can approach the Cloudflare’s challenges:
- Complete the challenges and hCaptcha (hard way).
- Avoid the challenges altogether by imitating a real browser (easy way).
The simplest option is to try to be ignored by the WAF by imitating a real browser’s properties. You can use Selenium library, which implements many techniques to avoid triggering the WAFs.
You can use undetected-chromedriver in Selenium:
import undetected_chromedriver as uc
from webdriver_manager.chrome import ChromeDriverManager
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
import time
options = webdriver.ChromeOptions()
chrome_path = ChromeDriverManager().install()
chrome_service = Service(chrome_path)
driver = uc.Chrome(options=options, service=chrome_service, use_subprocess=True)
url = "https://www.off---white.com/en/GB/"
driver.get(url)
time.sleep(5)
#Do something
driver.quit()
There are several factors to take into consideration. One of them is the IP Reputation and, in this case, geolocation. It looks like the site is more aggressive in blocking requests from some regions.
To avoid that, you could use high-quality proxies, preferably ones that allow geolocation. I’d set it to GB to avoid further problems.
More information about how Cloudflare works.