How to get the scrapy failure URLs?

Question

I’m a newbie of scrapy and it’s amazing crawler framework i have known!

In my project, I sent more than 90, 000 requests, but there are some of them failed.
I set the log level to be INFO, and i just can see some statistics but no details.

2012-12-05 21:03:04+0800 [pd_spider] INFO: Dumping spider stats:
{'downloader/exception_count': 1,
 'downloader/exception_type_count/twisted.internet.error.ConnectionDone': 1,
 'downloader/request_bytes': 46282582,
 'downloader/request_count': 92383,
 'downloader/request_method_count/GET': 92383,
 'downloader/response_bytes': 123766459,
 'downloader/response_count': 92382,
 'downloader/response_status_count/200': 92382,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2012, 12, 5, 13, 3, 4, 836000),
 'item_scraped_count': 46191,
 'request_depth_max': 1,
 'scheduler/memory_enqueued': 92383,
 'start_time': datetime.datetime(2012, 12, 5, 12, 23, 25, 427000)}

Is there any way to get more detail report? For example, show those failed URLs. Thanks!

Asked By: Joe Wu

||

Source

Answer 1

Yes, this is possible.

The code below adds a failed_urls list to a basic spider class and appends urls to it if the response status of the url is 404 (this would need to be extended to cover other error statuses as required).
Next I added a handle that joins the list into a single string and adds it to the spider’s stats when the spider is closed.
Based on your comments, it’s possible to track Twisted errors, and some of the answers below give examples on how to handle that particular use case
The code has been updated to work with Scrapy 1.8. All thanks to this should go to Juliano Mendieta, since all I did was simply to add his suggested edits and confirm that the spider worked as intended.

from scrapy import Spider, signals

class MySpider(Spider):
    handle_httpstatus_list = [404] 
    name = "myspider"
    allowed_domains = ["example.com"]
    start_urls = [
        'http://www.example.com/thisurlexists.html',
        'http://www.example.com/thisurldoesnotexist.html',
        'http://www.example.com/neitherdoesthisone.html'
    ]

    def __init__(self, *args, **kwargs):
            super().__init__(*args, **kwargs)
            self.failed_urls = []

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super(MySpider, cls).from_crawler(crawler, *args, **kwargs)
        crawler.signals.connect(spider.handle_spider_closed, signals.spider_closed)
        return spider

    def parse(self, response):
        if response.status == 404:
            self.crawler.stats.inc_value('failed_url_count')
            self.failed_urls.append(response.url)

    def handle_spider_closed(self, reason):
        self.crawler.stats.set_value('failed_urls', ', '.join(self.failed_urls))

    def process_exception(self, response, exception, spider):
        ex_class = "%s.%s" % (exception.__class__.__module__, exception.__class__.__name__)
        self.crawler.stats.inc_value('downloader/exception_count', spider=spider)
        self.crawler.stats.inc_value('downloader/exception_type_count/%s' % ex_class, spider=spider)

Example output (note that the downloader/exception_count* stats will only appear if exceptions are actually thrown – I simulated them by trying to run the spider after I’d turned off my wireless adapter):

2012-12-10 11:15:26+0000 [myspider] INFO: Dumping Scrapy stats:
    {'downloader/exception_count': 15,
     'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 15,
     'downloader/request_bytes': 717,
     'downloader/request_count': 3,
     'downloader/request_method_count/GET': 3,
     'downloader/response_bytes': 15209,
     'downloader/response_count': 3,
     'downloader/response_status_count/200': 1,
     'downloader/response_status_count/404': 2,
     'failed_url_count': 2,
     'failed_urls': 'http://www.example.com/thisurldoesnotexist.html, http://www.example.com/neitherdoesthisone.html'
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2012, 12, 10, 11, 15, 26, 874000),
     'log_count/DEBUG': 9,
     'log_count/ERROR': 2,
     'log_count/INFO': 4,
     'response_received_count': 3,
     'scheduler/dequeued': 3,
     'scheduler/dequeued/memory': 3,
     'scheduler/enqueued': 3,
     'scheduler/enqueued/memory': 3,
     'spider_exceptions/NameError': 2,
     'start_time': datetime.datetime(2012, 12, 10, 11, 15, 26, 560000)}

Answered By: Talvalin

Answer 2

Here’s another example how to handle and collect 404 errors (checking github help pages):

from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.item import Item, Field


class GitHubLinkItem(Item):
    url = Field()
    referer = Field()
    status = Field()


class GithubHelpSpider(CrawlSpider):
    name = "github_help"
    allowed_domains = ["help.github.com"]
    start_urls = ["https://help.github.com", ]
    handle_httpstatus_list = [404]
    rules = (Rule(SgmlLinkExtractor(), callback='parse_item', follow=True),)

    def parse_item(self, response):
        if response.status == 404:
            item = GitHubLinkItem()
            item['url'] = response.url
            item['referer'] = response.request.headers.get('Referer')
            item['status'] = response.status

            return item

Just run scrapy runspider with -o output.json and see list of items in the output.json file.

Answered By: alecxe

Answer 3

The answers from @Talvalin and @alecxe helped me a great deal, but they do not seem to capture downloader events that do not generate a response object (for instance, twisted.internet.error.TimeoutError and twisted.web.http.PotentialDataLoss). These errors show up in the stats dump at the end of the run, but without any meta info.

As I found out here, the errors are tracked by the stats.py middleware, captured in the DownloaderStats class’ process_exception method, and specifically in the ex_class variable, which increments each error type as necessary, and then dumps the counts at the end of the run.

To match such errors with information from the corresponding request object, you can add a unique id to each request (via request.meta), then pull it into the process_exception method of stats.py:

self.stats.set_value('downloader/my_errs/{0}'.format(request.meta), ex_class)

That will generate a unique string for each downloader-based error not accompanied by a response. You can then save the altered stats.py as something else (e.g. my_stats.py), add it to the downloadermiddlewares (with the right precedence), and disable the stock stats.py:

DOWNLOADER_MIDDLEWARES = {
    'myproject.my_stats.MyDownloaderStats': 850,
    'scrapy.downloadermiddleware.stats.DownloaderStats': None,
    }

The output at the end of the run looks like this (here using meta info where each request url is mapped to a group_id and member_id separated by a slash, like '0/14'):

{'downloader/exception_count': 3,
 'downloader/exception_type_count/twisted.web.http.PotentialDataLoss': 3,
 'downloader/my_errs/0/1': 'twisted.web.http.PotentialDataLoss',
 'downloader/my_errs/0/38': 'twisted.web.http.PotentialDataLoss',
 'downloader/my_errs/0/86': 'twisted.web.http.PotentialDataLoss',
 'downloader/request_bytes': 47583,
 'downloader/request_count': 133,
 'downloader/request_method_count/GET': 133,
 'downloader/response_bytes': 3416996,
 'downloader/response_count': 130,
 'downloader/response_status_count/200': 95,
 'downloader/response_status_count/301': 24,
 'downloader/response_status_count/302': 8,
 'downloader/response_status_count/500': 3,
 'finish_reason': 'finished'....}

This answer deals with non-downloader-based errors.

Answered By: scharfmn

Answer 4

As of scrapy 0.24.6, the method suggested by alecxe won’t catch errors with the start URLs. To record errors with the start URLs you need to override parse_start_urls. Adapting alexce’s answer for this purpose, you’d get:

from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.item import Item, Field

class GitHubLinkItem(Item):
    url = Field()
    referer = Field()
    status = Field()

class GithubHelpSpider(CrawlSpider):
    name = "github_help"
    allowed_domains = ["help.github.com"]
    start_urls = ["https://help.github.com", ]
    handle_httpstatus_list = [404]
    rules = (Rule(SgmlLinkExtractor(), callback='parse_item', follow=True),)

    def parse_start_url(self, response):
        return self.handle_response(response)

    def parse_item(self, response):
        return self.handle_response(response)

    def handle_response(self, response):
        if response.status == 404:
            item = GitHubLinkItem()
            item['url'] = response.url
            item['referer'] = response.request.headers.get('Referer')
            item['status'] = response.status

            return item

Answered By: Louis

Answer 5

This is an update on this question. I ran in to a similar problem and needed to use the scrapy signals to call a function in my pipeline. I have edited @Talvalin’s code, but wanted to make an answer just for some more clarity.

Basically, you should add in self as an argument for handle_spider_closed.
You should also call the dispatcher in init so that you can pass the spider instance (self) to the handleing method.

from scrapy.spider import Spider
from scrapy.xlib.pydispatch import dispatcher
from scrapy import signals

class MySpider(Spider):
    handle_httpstatus_list = [404] 
    name = "myspider"
    allowed_domains = ["example.com"]
    start_urls = [
        'http://www.example.com/thisurlexists.html',
        'http://www.example.com/thisurldoesnotexist.html',
        'http://www.example.com/neitherdoesthisone.html'
    ]

    def __init__(self, category=None):
        self.failed_urls = []
        # the dispatcher is now called in init
        dispatcher.connect(self.handle_spider_closed,signals.spider_closed) 


    def parse(self, response):
        if response.status == 404:
            self.crawler.stats.inc_value('failed_url_count')
            self.failed_urls.append(response.url)

    def handle_spider_closed(self, spider, reason): # added self 
        self.crawler.stats.set_value('failed_urls',','.join(spider.failed_urls))

    def process_exception(self, response, exception, spider):
        ex_class = "%s.%s" % (exception.__class__.__module__,  exception.__class__.__name__)
        self.crawler.stats.inc_value('downloader/exception_count', spider=spider)
        self.crawler.stats.inc_value('downloader/exception_type_count/%s' % ex_class, spider=spider)

I hope this helps anyone with the same problem in the future.

Answered By: Mattias

Answer 6

Scrapy ignores 404 by default and does not parse it. If you are getting an error code 404 in response, you can handle this with a very easy way.

In settings.py, write:

HTTPERROR_ALLOWED_CODES = [404,403]

And then handle the response status code in your parse function:

def parse(self,response):
    if response.status == 404:
        #your action on error

Answered By: Pythonsguru

Answer 7

In addition to some of these answers, if you want to track Twisted errors, I would take a look at using the Request object’s errback parameter, on which you can set a callback function to be called with the Twisted Failure on a request failure.
In addition to the url, this method can allow you to track the type of failure.

You can then log the urls by using: failure.request.url (where failure is the Twisted Failure object passed into errback).

# these would be in a Spider
def start_requests(self):
    for url in self.start_urls:
        yield scrapy.Request(url, callback=self.parse,
                                  errback=self.handle_error)

def handle_error(self, failure):
    url = failure.request.url
    logging.error('Failure type: %s, URL: %s', failure.type,
                                               url)

The Scrapy docs give a full example of how this can be done, except that the calls to the Scrapy logger are now depreciated, so I’ve adapted my example to use Python’s built in logging):

https://doc.scrapy.org/en/latest/topics/request-response.html#topics-request-response-ref-errbacks

Answered By: Michael

Answer 8

You can capture failed urls in two ways.

Define scrapy request with errback

class TestSpider(scrapy.Spider):
    def start_requests(self):
        yield scrapy.Request(url, callback=self.parse, errback=self.errback)

    def errback(self, failure):
        '''handle failed url (failure.request.url)'''
        pass

Use signals.item_dropped

class TestSpider(scrapy.Spider):
    def __init__(self):
        crawler.signals.connect(self.request_dropped, signal=signals.request_dropped)

    def request_dropped(self, request, spider):
        '''handle failed url (request.url)'''
        pass

[!Notice] Scrapy request with errback can not catch some auto retry failure, like connection error, RETRY_HTTP_CODES in settings.

Answered By: jdxin0

Answer 9

Basically Scrapy Ignores 404 Error by Default, It was defined in httperror middleware.

So, Add HTTPERROR_ALLOW_ALL = True to your settings file.

After this you can access response.status through your parse function.

You can handle it like this.

def parse(self,response):
    if response.status==404:
        print(response.status)
    else:
        do something

Answered By: Mohan B E

How to get the scrapy failure URLs?

Question:

Answers: