Scrapy Increase robustness when scraping

Question:

I am trying my best to search for a setting on the Scrapy spider
when the following condition occurs.

  1. In the middle of my scraping activity If I have a power failure
  2. my ISP goes down

and the behavior i am expecting is Scrapy should not give up.
rather wait infinitely for power to be restored and continue scraping by retrying the requests after a brief pause or interval of 10secs.

This is the error message that i get when my internet goes off.

 https://example.com/1.html
 2022-10-21 17:44:14 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying 
 <GET https://www.example.com/1.html
 (failed 1 times): An error occurred while connecting: 10065: A socket operation was attempted to an unreachable host..

And the message repeats.

what i am afraid is when the blip is restored scrapy would have given up trying 1.html and might have gone to another url called 99.html.

My question is when the error socket operation to an unreachable host occurs how to make scrapy wait and retry the same url https://www.example.com/1.html

Thanks in advance.

Asked By: asfand hikmat

||

Answers:

There is no built in setting that will do this, however this can still be implemented rather easily.

The way that seems the most straight forward to me would be to catch the response_received signal in your spider and check for the specific error code you receive when your ISP goes down. When this happens you can pause the scrapy engine and wait for any amount of time you want and then retry that same request again, until it succeeds.

for example:

from scrapy import Spider
from scrapy.signals import response_received

class MySpider(Spider):
   ...
   ...

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super(MySpider, cls).from_crawler(crawler, *args, **kwargs)
        # listen for the response_received signal and call check_response
        crawler.signals.connect(spider.check_response, signal=response_received)
        return spider

    def check_response(self, response, request, spider):
        engine = spider.crawler.engine
        if response.status == 404:  # <- your error code goes here
            engine.pause()
            time.sleep(6000)        # <- wait 10 minutes
            request.dont_filter = True   # <- tell engine not to filter
            engine.unpause()
            engine.crawl(request.copy())  # <- resend the request

Update

Since it isn’t an http error code you are receiving the next best solution would be to create a custom DownloaderMiddleware that catches the exceptions and then pretty much does the same thing that is done in the first example.

In your middlewares.py file:

import time
from twisted.internet.error import (ConnectError, ConnectionLost
                                    TimeoutError, DNSLookupError,
                                    ConnectionRefusedError)

class ConnectionLostPauseDownloadMiddleware:

    def __init__(self, settings, crawler):
        self.crawler = crawler
        self.exceptions = (ConnectionRefusedError, ConnectionDone, ConnectError, ConnectionLost)

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler.settings, crawler)

    def process_exception(self, request, exception, spider):
        if isinstance(exception, self.exceptions):
            new_request = request.copy()
            new_request.dont_filter = True
            self.crawler.engine.pause()
            time.sleep(60 * 10)
            self.crawler.engine.unpause()
            return new_request

Then in your settings.py

DOWNLOADER_MIDDLEWARES = {
   'MyProjectName.middlewares.ConnectionLostPauseDownloadMiddleware': 543,
}
Answered By: Alexander
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.