scrapy – parsing items that are paginated

Question:

I have a url of the form:

example.com/foo/bar/page_1.html

There are a total of 53 pages, each one of them has ~20 rows.

I basically want to get all the rows from all the pages, i.e. ~53*20 items.

I have working code in my parse method, that parses a single page, and also goes one page deeper per item, to get more info about the item:

  def parse(self, response):
    hxs = HtmlXPathSelector(response)

    restaurants = hxs.select('//*[@id="contenido-resbus"]/table/tr[position()>1]')

    for rest in restaurants:
      item = DegustaItem()
      item['name'] = rest.select('td[2]/a/b/text()').extract()[0]
      # some items don't have category associated with them
      try:
        item['category'] = rest.select('td[3]/a/text()').extract()[0]
      except:
        item['category'] = ''
      item['urbanization'] = rest.select('td[4]/a/text()').extract()[0]

      # get profile url
      rel_url = rest.select('td[2]/a/@href').extract()[0]
      # join with base url since profile url is relative
      base_url = get_base_url(response)
      follow = urljoin_rfc(base_url,rel_url)

      request = Request(follow, callback = parse_profile)
      request.meta['item'] = item
      return request


  def parse_profile(self, response):
    item = response.meta['item']
    # item['address'] = figure out xpath
    return item

The question is, how do I crawl each page?

example.com/foo/bar/page_1.html
example.com/foo/bar/page_2.html
example.com/foo/bar/page_3.html
...
...
...
example.com/foo/bar/page_53.html
Asked By: AlexBrand

||

Answers:

You have two options to solve your problem. The general one is to use yield to generate new requests instead of return. That way you can issue more than one new request from a single callback. Check the second example at http://doc.scrapy.org/en/latest/topics/spiders.html#basespider-example.

In your case there is probably a simpler solution: Just generate the list of start urs from a patter like this:

class MySpider(BaseSpider):
    start_urls = ['http://example.com/foo/bar/page_%s.html' % page for page in xrange(1,54)]
Answered By: Achim

You could use the CrawlSpider instead of the BaseSpider and use SgmlLinkExtractor to extract the pages in the pagination.

For instance:

start_urls = ["www.example.com/page1"]
rules = ( Rule (SgmlLinkExtractor(restrict_xpaths=('//a[@class="next_page"]',))
                , follow= True),
          Rule (SgmlLinkExtractor(restrict_xpaths=('//div[@class="foto_imovel"]',))
                , callback='parse_call')
    )

The first rule tells scrapy to follow the link contained in the xpath expression, the second rule tells scrapy to call the parse_call to links contained in the xpath expression, in case you want to parse something in each page.

For more info please see the doc: http://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider

Answered By: bslima

There can be two use cases for ‘scrapy – parsing items that are paginated’.

A). We just want to move across the table and fetch data. This is relatively straight forward.

class TrainSpider(scrapy.Spider):
    name = "trip"
    start_urls = ['somewebsite']
    def parse(self, response):
        ''' do something with this parser '''
        next_page = response.xpath("//a[@class='next_page']/@href").extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

Observe the last 4 lines. Here

  1. We are getting the next page link form next page xpath from the ‘Next’ pagination button.
  2. if condition to check if its not the end of the pagination.
  3. Join this link (that we got in step 1) with the main url using url join
  4. A recursive call to the parse call back method.

B)Not only we want to move across pages, but we also want to extract data from one or more links in that page.

class StationDetailSpider(CrawlSpider):
    name = 'train'
    start_urls = [someOtherWebsite]
    rules = (
        Rule(LinkExtractor(restrict_xpaths="//a[@class='next_page']"), follow=True),
        Rule(LinkExtractor(allow=r"/trains/d+$"), callback='parse_trains')
    )
    def parse_trains(self, response):
    '''do your parsing here'''

Overhere, observe that:

  1. We are using the CrawlSpider subclass of the scrapy.Spider parent class

  2. We have set to ‘Rules’

    a) The first rule, just checks if there is a ‘next_page’ available and follows it.

    b) The second rule requests for all the links on a page that are in the format, say /trains/12343 and then calls the parse_trains to perform and parsing operation.

  3. Important: Note that we don’t want to use the regular parse method over here as we are using CrawlSpider subclass. This class also has a parse method so we don’t want to override that. Just remember to name your call back method something other than parse.

Answered By: Santosh Pillai
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.