suppress Scrapy Item printed in logs after pipeline

Question

I have a scrapy project where the item that ultimately enters my pipeline is relatively large and stores lots of metadata and content. Everything is working properly in my spider and pipelines. The logs, however, are printing out the entire scrapy Item as it leaves the pipeline (I believe):

2013-01-17 18:42:17-0600 [tutorial] DEBUG: processing Pipeline pipeline module
2013-01-17 18:42:17-0600 [tutorial] DEBUG: Scraped from <200 http://www.example.com>
    {'attr1': 'value1',
     'attr2': 'value2',
     'attr3': 'value3',
     ...
     snip
     ...
     'attrN': 'valueN'}
2013-01-17 18:42:18-0600 [tutorial] INFO: Closing spider (finished)

I would rather not have all this data puked into log files if I can avoid it. Any suggestions about how to suppress this output?

Asked By: dino

||

Source

Answer 1

Having read through the documentation and conducted a (brief) search through the source code, I can’t see a straightforward way of achieving this aim.

The hammer approach is to set the logging level in the settings to INFO (ie add the following line to settings.py):

LOG_LEVEL='INFO'

This will strip out a lot of other information about the URLs/page that are being crawled, but it will definitely suppress data about processed items.

Answered By: Talvalin

Answer 2

or If you know that spider is working correctly then you can disable the entire logging

LOG_ENABLED = False

I disable that when my crawler runs fine

Answered By: Mirage

Answer 3

Another approach is to override the __repr__ method of the Item subclasses to selectively choose which attributes (if any) to print at the end of the pipeline:

from scrapy.item import Item, Field
class MyItem(Item):
    attr1 = Field()
    attr2 = Field()
    # ...
    attrN = Field()

    def __repr__(self):
        """only print out attr1 after exiting the Pipeline"""
        return repr({"attr1": self.attr1})

This way, you can keep the log level at DEBUG and show only the attributes that you want to see coming out of the pipeline (to check attr1, for example).

Answered By: dino

Answer 4

I tried the repre way mentioned by @dino, it doesn’t work well. But evolved from his idea, I tried the str method, and it works.

Here’s how I do it, very simple:

    def __str__(self):
        return ""

Answered By: KurtRao

Answer 5

If you want to exclude only some attributes of the output, you can extend the answer given by @dino

from scrapy.item import Item, Field
import json

class MyItem(Item):
    attr1 = Field()
    attr2 = Field()
    attr1ToExclude = Field()
    attr2ToExclude = Field()
    # ...
    attrN = Field()

    def __repr__(self):
        r = {}
        for attr, value in self.__dict__['_values'].iteritems():
            if attr not in ['attr1ToExclude', 'attr2ToExclude']:
                r[attr] = value
        return json.dumps(r, sort_keys=True, indent=4, separators=(',', ': '))

Answered By: mperrin

Answer 6

I think the cleanest way to do this is to add a filter to the scrapy.core.scraper logger that changes the message in question. This allows you to keep your Item’s __repr__ intact and to not have to change scrapy’s logging level:

import re

class ItemMessageFilter(logging.Filter):
    def filter(self, record):
        # The message that logs the item actually has raw % operators in it,
        # which Scrapy presumably formats later on
        match = re.search(r'(Scraped from %(src)s)n%(item)s', record.msg)
        if match:
            # Make the message everything but the item itself
            record.msg = match.group(1)
        # Don't actually want to filter out this record, so always return 1
        return 1

logging.getLogger('scrapy.core.scraper').addFilter(ItemMessageFilter())

Answered By: Charles Davis

Answer 7

If you found your way here because you had the same question years later, the easiest way to do this is with a LogFormatter:

class QuietLogFormatter(scrapy.logformatter.LogFormatter):
    def scraped(self, item, response, spider):
        return (
            super().scraped(item, response, spider)
            if spider.settings.getbool("LOG_SCRAPED_ITEMS")
            else None
        )

Just add LOG_FORMATTER = "path.to.QuietLogFormatter" to your settings.py and you will see all your DEBUG messages except for the scraped items. With LOG_SCRAPED_ITEMS = True you can restore the previous behaviour without having to change your LOG_FORMATTER.

Similarly you can customise the logging behaviour for crawled pages and dropped items.

Edit: I wrapped up this formatter and some other Scrapy stuff in this library.

Answered By: Markus Shepherd

Answer 8

We use the following sample in production:

import logging

logging.getLogger('scrapy.core.scraper').addFilter(
    lambda x: not x.getMessage().startswith('Scraped from'))

This is a very simple and working code. We add this code in __init__.py in module with spiders. In this case this code automatically run with command like scrapy crawl <spider_name> for all spiders.

Answered By: FreezemanDix

Answer 9

Create filter:

class ItemFilter(logging.Filter):
    def filter(self, record):
        is_item_log = not record.msg.startswith('Scraped from')
        return is_item_log

Then add it in __init__ of your spider.

class YourSpider(scrapy.Spider):
    name = "your_spider"

    def __init__(self, *args, **kwargs):
        super(JobSpider, self).__init__(*args, **kwargs)

        if int(getattr(self, "no_items_output", 0)):
            for handler in logging.root.handlers:
                handler.addFilter(ItemFilter())

And then you can run it doing scrapy crawl your_spider -a no_items_output=1

Answered By: Xoel

suppress Scrapy Item printed in logs after pipeline

Question:

Answers: