Scrapy Crawl URLs in Order

Question:

So, my problem is relatively simple. I have one spider crawling multiple sites, and I need it to return the data in the order I write it in my code. It’s posted below.

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from mlbodds.items import MlboddsItem

class MLBoddsSpider(BaseSpider):
   name = "sbrforum.com"
   allowed_domains = ["sbrforum.com"]
   start_urls = [
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110328/",
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110329/",
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110330/"
   ]

   def parse(self, response):
       hxs = HtmlXPathSelector(response)
       sites = hxs.select('//div[@id="col_3"]//div[@id="module3_1"]//div[@id="moduleData4952"]')
       items = []
       for site in sites:
           item = MlboddsItem()
           item['header'] = site.select('//div[@class="scoreboard-bar"]//h2//span[position()>1]//text()').extract()# | /*//table[position()<2]//tr//th[@colspan="2"]//text()').extract()
           item['game1'] = site.select('/*//table[position()=1]//tr//td[@class="tbl-odds-c2"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c4"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c6"]//text()').extract()
           items.append(item)
       return items

The results are returned in a random order, for example it returns the 29th, then the 28th, then the 30th. I’ve tried changing the scheduler order from DFO to BFO, just in case that was the problem, but that didn’t change anything.

Asked By: Jeff

||

Answers:

Disclaimer: haven’t worked with scrapy specifically

The scraper may be queueing and requeueing requests based on timeouts and HTTP errors, it would be a lot easier if you can get at the date from the response page?

I.e. add another hxs.select statement that grabs the date (just had a look, it is definitely in the response data), and add that to the item dict, sort items based on that.

This is probably a more robust approach, rather than relying on order of scrapes…

Answered By: Jan Z

I believe the

hxs.select('...')

you make will scrape the data from the site in the order it appears. Either that or scrapy is going through your start_urls in an arbitrary order. To force it to go through them in a predefined order, and mind you, this won’t work if you need to crawl more sites, then you can try this:

start_urls = ["url1.html"]

def parse1(self, response):
    hxs = HtmlXPathSelector(response)
   sites = hxs.select('blah')
   items = []
   for site in sites:
       item = MlboddsItem()
       item['header'] = site.select('blah')
       item['game1'] = site.select('blah')
       items.append(item)
   return items.append(Request('url2.html', callback=self.parse2))

then write a parse2 that does the same thing but appends a Request for url3.html with callback=self.parse3. This is horrible coding style, but I’m just throwing it out in case you need a quick hack.

Answered By: emish

I doubt if it’s possible to achieve what you want unless you play with scrapy internals. There are some similar discussions on scrapy google groups e.g.

http://groups.google.com/group/scrapy-users/browse_thread/thread/25da0a888ac19a9/1f72594b6db059f4?lnk=gst

One thing that can also help is
setting CONCURRENT_REQUESTS_PER_SPIDER
to 1, but it won’t completely ensure
the order either because the
downloader has its own local queue
for performance reasons, so the best
you can do is prioritize the requests
but not ensure its exact order.

Answered By: user

start_urls defines urls which are used in start_requests method. Your parse method is called with a response for each start urls when the page is downloaded. But you cannot control loading times – the first start url might come the last to parse.

A solution — override start_requests method and add to generated requests a meta with priority key. In parse extract this priority value and add it to the item. In the pipeline do something based in this value. (I don’t know why and where you need these urls to be processed in this order).

Or make it kind of synchronous — store these start urls somewhere. Put in start_urls the first of them. In parse process the first response and yield the item(s), then take next url from your storage and make a request for it with callback for parse.

Answered By: warvariuc

The google group discussion suggests using priority attribute in Request object.
Scrapy guarantees the urls are crawled in DFO by default. But it does not ensure that the urls are visited in the order they were yielded within your parse callback.

Instead of yielding Request objects you want to return an array of Requests from which objects will be popped till it is empty.

Can you try something like that?

from scrapy.spider import BaseSpider
from scrapy.http import Request
from scrapy.selector import HtmlXPathSelector
from mlbodds.items import MlboddsItem

class MLBoddsSpider(BaseSpider):
   name = "sbrforum.com"
   allowed_domains = ["sbrforum.com"]

   def start_requests(self):
       start_urls = reversed( [
           "http://www.sbrforum.com/mlb-baseball/odds-scores/20110328/",
           "http://www.sbrforum.com/mlb-baseball/odds-scores/20110329/",
           "http://www.sbrforum.com/mlb-baseball/odds-scores/20110330/"
       ] )

       return [ Request(url = start_url) for start_url in start_urls ]

   def parse(self, response):
       hxs = HtmlXPathSelector(response)
       sites = hxs.select('//div[@id="col_3"]//div[@id="module3_1"]//div[@id="moduleData4952"]')
       items = []
       for site in sites:
           item = MlboddsItem()
           item['header'] = site.select('//div[@class="scoreboard-bar"]//h2//span[position()>1]//text()').extract()# | /*//table[position()<2]//tr//th[@colspan="2"]//text()').extract()
           item['game1'] = site.select('/*//table[position()=1]//tr//td[@class="tbl-odds-c2"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c4"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c6"]//text()').extract()
           items.append(item)
       return items
Answered By: Alexis

Off course, you can control it.
The top secret is the method how to feed the greedy Engine/Schedulor. You requirement is just a little one. Please see I add a list named “task_urls”.

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http.request import Request
from dirbot.items import Website

class DmozSpider(BaseSpider):
   name = "dmoz"
   allowed_domains = ["sbrforum.com"]
   start_urls = [
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110328/",
   ]
   task_urls = [
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110328/",
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110329/",
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110330/"
   ]
   def parse(self, response): 

       hxs = HtmlXPathSelector(response)
       sites = hxs.select('//div[@id="col_3"]//div[@id="module3_1"]//div[@id="moduleData4952"]')
       items = []
       for site in sites:
           item = Website()
           item['header'] = site.select('//div[@class="scoreboard-bar"]//h2//span[position()>1]//text()').extract()# | /*//table[position()<2]//tr//th[@colspan="2"]//text()').extract()
           item['game1'] = site.select('/*//table[position()=1]//tr//td[@class="tbl-odds-c2"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c4"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c6"]//text()').extract()
           items.append(item)
       # Here we feed add new request
       self.task_urls.remove(response.url)
       if self.task_urls:
           r = Request(url=self.task_urls[0], callback=self.parse)
           items.append(r)

       return items

If you want some more complicated case, please see my project:
https://github.com/wuliang/TiebaPostGrabber

Answered By: wuliang

The solution is sequential.
This solution is similar to @wuliang

I started with @Alexis de Tréglodé method but reached a problem:
The fact that your start_requests() method returns a list of URLS
return [ Request(url = start_url) for start_url in start_urls ]
is causing the output to be non-sequential (asynchronous)

If the return is a single response then by creating an alternative other_urls can fulfill the requirements. Also, other_urls can be used to add-into URLs scraped from other webpages.

from scrapy import log
from scrapy.spider import BaseSpider
from scrapy.http import Request
from scrapy.selector import HtmlXPathSelector
from practice.items import MlboddsItem

log.start()

class PracticeSpider(BaseSpider):
    name = "sbrforum.com"
    allowed_domains = ["sbrforum.com"]

    other_urls = [
            "http://www.sbrforum.com/mlb-baseball/odds-scores/20110328/",
            "http://www.sbrforum.com/mlb-baseball/odds-scores/20110329/",
            "http://www.sbrforum.com/mlb-baseball/odds-scores/20110330/",
           ]

    def start_requests(self):
        log.msg('Starting Crawl!', level=log.INFO)
        start_urls = "http://www.sbrforum.com/mlb-baseball/odds-scores/20110327/"
        return [Request(start_urls, meta={'items': []})]

    def parse(self, response):
        log.msg("Begin Parsing", level=log.INFO)
        log.msg("Response from: %s" % response.url, level=log.INFO)
        hxs = HtmlXPathSelector(response)
        sites = hxs.select("//*[@id='moduleData8460']")
        items = response.meta['items']
        for site in sites:
            item = MlboddsItem()
            item['header'] = site.select('//div[@class="scoreboard-bar"]//h2//span[position()>1]//text()').extract()
            item['game1'] = site.select('/*//table[position()=1]//tr//td[@class="tbl-odds-c2"]//text()').extract()
            items.append(item)

        # here we .pop(0) the next URL in line
        if self.other_urls:
            return Request(self.other_urls.pop(0), meta={'items': items})

        return items
Answered By: user1460015

Scrapy Request has a priority attribute now.

If you have many Request in a function and want to process a particular request first, you can set:

def parse(self, response):
    url = 'http://www.example.com/first'
    yield Request(url=url, callback=self.parse_data, priority=1)

    url = 'http://www.example.com/second'
    yield Request(url=url, callback=self.parse_data)

Scrapy will process the one with priority=1 first.

Answered By: Sandeep Balagopal

Personally I like @user1460015’s implementation after I managed to have my own work around solution.

My solution is to use subprocess of Python to call scrapy url by url until all urls have been took care of.

In my code, if user does not specify he/she wants to parse the urls sequentially, we can start the spider in a normal way.

process = CrawlerProcess({'USER_AGENT': 'Mozilla/4.0 (compatible; 
    MSIE 7.0; Windows NT 5.1)'})
process.crawl(Spider, url = args.url)
process.start()

If a user specifies it needs to be done sequentially, we can do this:

for url in urls:
    process = subprocess.Popen('scrapy runspider scrapper.py -a url='
        + url + ' -o ' + outputfile)
    process.wait()

Note that: this implementation does not handle errors.

Answered By: Leon Hu

There is a much easier way to make scrapy follow the order of starts_url: you can just uncomment and change the concurrent requests in settings.py to 1.

Configure maximum concurrent requests performed by Scrapy (default: 16) 
CONCURRENT_REQUESTS = 1
Answered By: Higor Sigaki

Most of answers suggest passing urls one by one or limiting the concurrency to 1,
which will slow you down significantly if you’re scraping multiple urls.

While I faced this same problem my solution was using the callback arguments to store scraped data
from all the urls, and sort it using the order of the initial urls, then return all the scraped data in ordered at once, something like this:

import scrapy

class MLBoddsSpider(scrapy.Spider):
   name = "sbrforum.com"
   allowed_domains = ["sbrforum.com"]
   to_scrape_urls = [
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110328/",
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110329/",
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110330/"
   ]

   def start_requests(self):
       data = {}
       for url in self.to_scrape_urls:
           yield scrapy.Request(url, self.parse, cb_kwargs=data)

   def parse(self, response, **kwargs):
       # scrape the data and add it to kwargs
       kwargs[response.url] = response.css('myData').get()

       # check if all urls has been scraped yet
       if len(kwargs) == len(self.to_scrape_urls):
           # return a sorted list of your data
           return [kwargs[url] for url in self.to_scrape_urls]
Answered By: AMasrar

I know this is an old question but I struggled with this problem today and was not completely satisfied with the solutions I found in this thread. Here’s how I handled it.

the spider:

import scrapy

class MySpider(scrapy.Spider):

    name = "mySpider"

    start_urls = None

    def parse(self, response):
        #your parsing code goes here

    def __init__(self, urls):
        self.start_urls = urls

and the spider runner:

from twisted.internet import reactor, defer
import spiders.mySpider as ms
from scrapy.crawler import CrawlerRunner

urls = [
    'http://address1.com',
    'http://address2.com',
    'http://address3.com'
   ]

runner = CrawlerRunner()

@defer.inlineCallbacks
def crawl():
    for url in urls:
        yield runner.crawl(ms.MySpider, urls = [url])
    reactor.stop()

crawl()
reactor.run()

this code calls the spider with a url from the list passed as a parameter and then waits until it is finished before calling the spider again with the next url

Answered By: McHat

add this in settings

SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue' 
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'
Answered By: bugMaker