Scrapy – Silently drop an item

Question:

I am using Scrapy to crawl several websites, which may share redundant information.

For each page I scrape, I store the url of the page, its title and its html code, into mongoDB.
I want to avoid duplication in database, thus, I implement a pipeline in order to check if a similar item is already stored. In such a case, I raise a DropItem exception.

My problem is that whenever I drop an item by raison a DropItem exception, Scrapy will display the entire content of the item into the log (stdout or file).
As I’m extracting the entire HTML code of each scraped page, in case of a drop, the whole HTML code will be displayed into the log.

How could I silently drop an item without its content being shown?

Thank you for your time!

class DatabaseStorage(object):
    """ Pipeline in charge of database storage.

    The 'whole' item (with HTML and text) will be stored in mongoDB.
    """

    def __init__(self):
        self.mongo = MongoConnector().collection

    def process_item(self, item, spider):
        """ Method in charge of item valdation and processing. """
        if item['html'] and item['title'] and item['url']:
            # insert item in mongo if not already present
            if self.mongo.find_one({'title': item['title']}):
                raise DropItem('Item already in db')
            else:
                self.mongo.insert(dict(item))
                log.msg("Item %s scraped" % item['title'],
                    level=log.INFO, spider=spider)
        else:
            raise DropItem('Missing information on item %s' % (
                'scraped from ' + item.get('url')
                or item.get('title')))
        return item
Asked By: Balthazar Rouberol

||

Answers:

Ok, I found the answer before even posting the question.
I still think that the answer might be valuable to anyone having the same problem.

Instead of dropping the object with a DropItem exception, you just have to return a None value:

def process_item(self, item, spider):
    """ Method in charge of item valdation and processing. """
    if item['html'] and item['title'] and item['url']:
        # insert item in mongo if not already present
        if self.mongo.find_one({'url': item['url']}):
            return
        else:
            self.mongo.insert(dict(item))
            log.msg("Item %s scraped" % item['title'],
                level=log.INFO, spider=spider)
    else:
        raise DropItem('Missing information on item %s' % (
           'scraped from ' + item.get('url')
            or item.get('title')))
    return item
Answered By: Balthazar Rouberol

The proper way to do this looks to be to implement a custom LogFormatter for your project, and change the logging level of dropped items.

Example:

from scrapy import log
from scrapy import logformatter

class PoliteLogFormatter(logformatter.LogFormatter):
    def dropped(self, item, exception, response, spider):
        return {
            'level': log.DEBUG,
            'format': logformatter.DROPPEDFMT,
            'exception': exception,
            'item': item,
        }

Then in your settings file, something like:

LOG_FORMATTER = 'apps.crawler.spiders.PoliteLogFormatter'

I had bad luck just returning “None,” which caused exceptions in future pipelines.

Answered By: jimmytheleaf

In recent Scrapy versions, this has been changed a bit. I copied the code from @jimmytheleaf and fixed it to work with recent Scrapy:

import logging
from scrapy import logformatter


class PoliteLogFormatter(logformatter.LogFormatter):
    def dropped(self, item, exception, response, spider):
        return {
            'level': logging.INFO,
            'msg': logformatter.DROPPEDMSG,
            'args': {
                'exception': exception,
                'item': item,
            }
        }
Answered By: mirosval

Another solution to this problem is to adjust repr method in scrapy.Item subclass

class SomeItem(scrapy.Item):
    scrape_date = scrapy.Field()
    spider_name = scrapy.Field()
    ...

    def __repr__(self):
        return ""

This way the item will not show up at all in the logs.

Answered By: Levon

As Levon indicates in its previous comment, it is possible too to overload the __repr__ function of the Item you are processing.

This way, the message will be displayed in the Scrapy log but and you wouldn’t l you can control the length of the code to show in the log, for example, the first 150 characters of the web page.
Assuming that you have an Item that represent an HTML page like this, the overload of __repr__ could be like the following:

class MyHTMLItem(Scrapy.Item):
    url = scrapy.Field()
    htmlcode = scrapy.Field()
    [...]
    def __repr__(self):
        s = ""
        s += "URL: %sn" % self.get('URL')
        s += "Code (chunk): %sn" % ((self.get('htmlcode'))[0:100])
        return s
Answered By: Felipower

For me it was necessary to use the ItemAdapter to convert the Item parameter into a list. So I was able to query the database.

from itemadapter import ItemAdapter, adapter
import pymongo
from scrapy.exceptions import DropItem

collection_name = 'myCollection'
    
    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db
    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
        )

def open_spider(self, spider):
    self.client = pymongo.MongoClient(self.mongo_uri)
    self.db = self.client[self.mongo_db]

def close_spider(self, spider):
    self.client.close()

def process_item(self, item, spider):
    adapter = ItemAdapter(item)
    if self.db[self.collection_name].find_one({'id':adapter['id']}) != None:
        dado = self.db[self.collection_name].find_one_and_update({'id':adapter['id']})
        ## ----> raise DropItem(f"Duplicate item found: {item!r}") <------
        print(f"Duplicate item found: {dado!r}")
    else:
        self.db[self.collection_name].insert_one(ItemAdapter(item).asdict())
    return item
Answered By: Ericksan Pimentel
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.