How to bypass cloudflare bot/ddos protection in Scrapy?

Question:

I used to scrape e-commerce webpage occasionally to get product prices information. I have not used the scraper built using Scrapy in a while and yesterday was trying to use it – I run into a problem with bot protection.

It is using CloudFlare’s DDOS protection which is basically using JavaScript evaluation to filter out the browsers (and therefore scrapers) with JS disabled. Once the function is evaluated, the response with calculated number is generated. In return, service sends back two authentication cookies which attached to each request allow to normally crawl the site. Here‘s the description of how it works.

I have also found a cloudflare-scrape Python module that uses external JS evaluation engine to calculate the number and send the request back to server. I’m not sure how to integrate it into Scrapy though. Or maybe there’s a smarter way without using JS execution? In the end, it’s a form…

I’d apriciate any help.

Asked By: Kulbi

||

Answers:

Obviously the best way to do this would be to whitelist your IP in CloudFlare; if this isn’t suitable let me recommend the cloudflare-scrape library. You can use this to get the cookie token, then provide this cookie token in your Scrapy request back to the server.

Answered By: mjsa

So I executed JavaScript using Python with help of cloudflare-scrape.

To your scraper, you need to add the following code:

def start_requests(self):
  for url in self.start_urls:
    token, agent = cfscrape.get_tokens(url, 'Your prefarable user agent, _optional_')
    yield Request(url=url, cookies=token, headers={'User-Agent': agent})

alongside parsing functions. And that’s it!

Of course, you need to install cloudflare-scrape first and import it to your spider. You also need a JS execution engine installed. I had Node.JS already, no complaints.

Answered By: Kulbi

If you’re getting 503 Error you can follow these guidelines:

  1. Go to settings.py
  2. Search for: USER_AGENT
  3. Here you will see the default bot user agent by scrapy.
    Replace that default with this:

    USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'
    
Answered By: Shahzaib Chadhar

Since cloudflare-scrape hasn’t been maintained for a while, anyone having problems can switch to cloudscraper instead:

pip install cloudscraper

Then you can initialize the scraper inside of your class as:

self.scraper = cloudscraper.create_scraper()

You can also pass requests.Session object to automatically grab the headers and cookies. It’s also possible to use a custom User-Agent while you can also specify a device or browser:

# Just use desktop firefox user-agents on windows
self.scraper = cloudscraper.create_scraper(
    browser={"browser": "firefox", "platform": "windows", "desktop": False}
)
# Custom will also try find the user-agent string in the pre-defined database
# If a match is found, it will use the headers and cipherSuite from that "browser",
# Otherwise a generic set of headers and cipherSuite will be used.
scraper = cloudscraper.create_scraper(
    browser={
        'custom': useragent
    }
)

After that, the usual script the OP presented will work with a little tweak:

def start_requests(self):
  for url in self.start_urls:
    token, agent = self.scraper.get_tokens(url)
    yield Request(url=url, cookies=token, headers={'User-Agent': agent})

Of course, you can still get blocked by Cloudflare’s WAF. You can check this out for more detailed information and bypassing techniques: bypassing Cloudflare.

Answered By: AnderRV
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.