How to add Headers to Scrapy CrawlSpider Requests?

Question:

I’m working with the CrawlSpider class to crawl a website and I would like to modify the headers that are sent in each request. Specifically, I would like to add the referer to the request.

As per this question, I checked

response.request.headers.get('Referer', None)

in my response parsing function and the Referer header is not present. I assume that means the Referer is not being submitted in the request (unless the website doesn’t return it, I’m not sure on that).

I haven’t been able to figure out how to modify the headers of a request. Again, my spider is derived from CrawlSpider. Overriding CrawlSpider’s _requests_to_follow or specifying a process_request callback for a rule will not work because the referer is not in scope at those times.

Does anyone know how to modify request headers dynamically?

Asked By: CatShoes

||

Answers:

You have to enable the SpiderMiddleware that will populate the referer for responses. See the documentation for scrapy.contrib.spidermiddleware.referer.RefererMiddleware

In short, you need to add this middleware to your project’s settings file.

SPIDER_MIDDLEWARES = {
'scrapy.contrib.spidermiddleware.referer.RefererMiddleware': True,
}

Then in your response parsing method, you can use, response.request.headers.get('Referrer', None), to get the referer.

Answered By: CatShoes

You can pass REFERER manually to each request using headers argument:

yield Request(parse=..., headers={'referer':...})

RefererMiddleware does the same, automatically taking the referrer url from the previous response.

Answered By: warvariuc
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.