Scrapy crawl http header data only

Question:

(How) can I archieve that scrapy only downloads the header data of a website (for check purposes etc.)

I’ve tried to disable some download-middlewares but it doesn’t seem to work.

Asked By: Niklas Hantke

||

Answers:

Like @alexce said, you can issue HEAD Requests instead of the default GET:

Request(url, method="HEAD")

UPDATE: If you want to use HEAD requests for your start_urls you will need to override the make_requests_from_url method:

def make_requests_from_url(self, url):
    return Request(url, method='HEAD', dont_filter=True)

UPDATE: make_requests_from_url was removed in Scrapy 2.6.

Answered By: Steven Almeroth
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.