How to save the data from a scrapy crawler into a variable?

Question:

I’m currently building a web app meant to display the data collected by a scrapy spider. The user makes a request, the spider crawl a website, then return the data to the app in order to be prompted. I’d like to retrieve the data directly from the scraper, without relying on an intermediary .csv or .json file. Something like :

from scrapy.crawler import CrawlerProcess
from scraper.spiders import MySpider

url = 'www.example.com'
spider = MySpider()
crawler = CrawlerProcess()
crawler.crawl(spider, start_urls=[url])
crawler.start()
data = crawler.data # this bit
Asked By: Crolle

||

Answers:

This is not so easy because Scrapy is non-blocking and works in an event loop; it uses Twisted event loop, and Twisted event loop is not restartable, so you can’t write crawler.start(); data = crawler.data – after crawler.start() process runs forever, calling registered callbacks until it is killed or ended.

These answers may be relevant:

If you use an event loop in your app (e.g. you have a Twisted or Tornado web server) then it is possible to get the data from a crawl without storing it to disk. The idea is to listen to item_scraped signal. I’m using the following helper to make it nicer:

import collections

from twisted.internet.defer import Deferred
from scrapy.crawler import Crawler
from scrapy import signals

def scrape_items(crawler_runner, crawler_or_spidercls, *args, **kwargs):
    """
    Start a crawl and return an object (ItemCursor instance)
    which allows to retrieve scraped items and wait for items
    to become available.

    Example:

    .. code-block:: python

        @inlineCallbacks
        def f():
            runner = CrawlerRunner()
            async_items = scrape_items(runner, my_spider)
            while (yield async_items.fetch_next):
                item = async_items.next_item()
                # ...
            # ...

    This convoluted way to write a loop should become unnecessary
    in Python 3.5 because of ``async for``.
    """
    crawler = crawler_runner.create_crawler(crawler_or_spidercls)    
    d = crawler_runner.crawl(crawler, *args, **kwargs)
    return ItemCursor(d, crawler)


class ItemCursor(object):
    def __init__(self, crawl_d, crawler):
        self.crawl_d = crawl_d
        self.crawler = crawler

        crawler.signals.connect(self._on_item_scraped, signals.item_scraped)

        crawl_d.addCallback(self._on_finished)
        crawl_d.addErrback(self._on_error)

        self.closed = False
        self._items_available = Deferred()
        self._items = collections.deque()

    def _on_item_scraped(self, item):
        self._items.append(item)
        self._items_available.callback(True)
        self._items_available = Deferred()

    def _on_finished(self, result):
        self.closed = True
        self._items_available.callback(False)

    def _on_error(self, failure):
        self.closed = True
        self._items_available.errback(failure)

    @property
    def fetch_next(self):
        """
        A Deferred used with ``inlineCallbacks`` or ``gen.coroutine`` to
        asynchronously retrieve the next item, waiting for an item to be
        crawled if necessary. Resolves to ``False`` if the crawl is finished,
        otherwise :meth:`next_item` is guaranteed to return an item
        (a dict or a scrapy.Item instance).
        """
        if self.closed:
            # crawl is finished
            d = Deferred()
            d.callback(False)
            return d

        if self._items:
            # result is ready
            d = Deferred()
            d.callback(True)
            return d

        # We're active, but item is not ready yet. Return a Deferred which
        # resolves to True if item is scraped or to False if crawl is stopped.
        return self._items_available

    def next_item(self):
        """Get a document from the most recently fetched batch, or ``None``.
        See :attr:`fetch_next`.
        """
        if not self._items:
            return None
        return self._items.popleft()

The API is inspired by motor, a MongoDB driver for async frameworks. Using scrape_items you can get items from twisted or tornado callbacks as soon as they are scraped, in a way similar to how you fetch items from a MongoDB query.

Answered By: Mikhail Korobov

you can pass the variable as an attribute of the class and store the data in it.

of curse you need to add the attribute in the __init__ method of you spider class.

from scrapy.crawler import CrawlerProcess
from scraper.spiders import MySpider

url = 'www.example.com'
spider = MySpider()
crawler = CrawlerProcess()
data = []
crawler.crawl(spider, start_urls=[url], data)
crawler.start()
print(data)
Answered By: hussein13

This is probably too late but it may help others, you can pass a callback function to the Spider and call that function to return your data like so:

The dummy spider that we are going to use:

class Trial(Spider):
    name = 'trial'

    start_urls = ['']

    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.output_callback = kwargs.get('args').get('callback')

    def parse(self, response):
        pass

    def close(self, spider, reason):
        self.output_callback(['Hi, This is the output.'])

A custom class with the callback:

from scrapy.crawler import CrawlerProcess
from scrapyapp.spiders.trial_spider import Trial


class CustomCrawler:

    def __init__(self):
        self.output = None
        self.process = CrawlerProcess(settings={'LOG_ENABLED': False})

    def yield_output(self, data):
        self.output = data

    def crawl(self, cls):
        self.process.crawl(cls, args={'callback': self.yield_output})
        self.process.start()


def crawl_static(cls):
    crawler = CustomCrawler()
    crawler.crawl(cls)
    return crawler.output

Then you can do:

out = crawl_static(Trial)
print(out)
Answered By: Siddhant

My answer is inspired from Siddhant,

from scrapy import Spider


class MySpider(Spider):

    name = 'myspider'

    def parse(self, response):
        item = {
            'url': response.url,
            'status': response.status
        }
        yield self.output_callback(item) # instead of yield item
from scrapy.crawler import CrawlerProcess


class Crawler:

    def __init__(self):
        self.process = CrawlerProcess()
        self.scraped_items = []

    def process_item(self, item): # similar to process_item in pipeline
        item['scraped'] = 'yes'
        self.scraped_items.append(item)
        return item

    def spawn(self, **kwargs):
        self.process.crawl(MySpider,
                           output_callback=self.process_item,
                           **kwargs)

    def run(self):
        self.process.start()
if __name__ == '__main__':
    crawler = Crawler()
    crawler.spawn(start_urls=['https://www.example.com', 'https://www.google.com'])
    crawler.run()

    print(crawler.scraped_items)

Output

[{'url': 'https://www.google.com', 'status': 200, 'scraped': 'yes'},
 {'url': 'https://www.example.com', 'status': 200, 'scraped': 'yes'}]

process_item is very useful for processing item as well as storing it.

Answered By: rish_hyun
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.