Scrapy crawl http header data only
Question:
(How) can I archieve that scrapy only downloads the header data of a website (for check purposes etc.)
I’ve tried to disable some download-middlewares but it doesn’t seem to work.
Answers:
Like @alexce said, you can issue HEAD Requests instead of the default GET:
Request(url, method="HEAD")
UPDATE: If you want to use HEAD requests for your start_urls
you will need to override the make_requests_from_url method:
def make_requests_from_url(self, url):
return Request(url, method='HEAD', dont_filter=True)
UPDATE: make_requests_from_url
was removed in Scrapy 2.6.
(How) can I archieve that scrapy only downloads the header data of a website (for check purposes etc.)
I’ve tried to disable some download-middlewares but it doesn’t seem to work.
Like @alexce said, you can issue HEAD Requests instead of the default GET:
Request(url, method="HEAD")
UPDATE: If you want to use HEAD requests for your start_urls
you will need to override the make_requests_from_url method:
def make_requests_from_url(self, url):return Request(url, method='HEAD', dont_filter=True)
UPDATE: make_requests_from_url
was removed in Scrapy 2.6.