scrapy – parsing items that are paginated
Question:
I have a url of the form:
example.com/foo/bar/page_1.html
There are a total of 53 pages, each one of them has ~20 rows.
I basically want to get all the rows from all the pages, i.e. ~53*20 items.
I have working code in my parse method, that parses a single page, and also goes one page deeper per item, to get more info about the item:
def parse(self, response):
hxs = HtmlXPathSelector(response)
restaurants = hxs.select('//*[@id="contenido-resbus"]/table/tr[position()>1]')
for rest in restaurants:
item = DegustaItem()
item['name'] = rest.select('td[2]/a/b/text()').extract()[0]
# some items don't have category associated with them
try:
item['category'] = rest.select('td[3]/a/text()').extract()[0]
except:
item['category'] = ''
item['urbanization'] = rest.select('td[4]/a/text()').extract()[0]
# get profile url
rel_url = rest.select('td[2]/a/@href').extract()[0]
# join with base url since profile url is relative
base_url = get_base_url(response)
follow = urljoin_rfc(base_url,rel_url)
request = Request(follow, callback = parse_profile)
request.meta['item'] = item
return request
def parse_profile(self, response):
item = response.meta['item']
# item['address'] = figure out xpath
return item
The question is, how do I crawl each page?
example.com/foo/bar/page_1.html
example.com/foo/bar/page_2.html
example.com/foo/bar/page_3.html
...
...
...
example.com/foo/bar/page_53.html
Answers:
You have two options to solve your problem. The general one is to use yield
to generate new requests instead of return
. That way you can issue more than one new request from a single callback. Check the second example at http://doc.scrapy.org/en/latest/topics/spiders.html#basespider-example.
In your case there is probably a simpler solution: Just generate the list of start urs from a patter like this:
class MySpider(BaseSpider):
start_urls = ['http://example.com/foo/bar/page_%s.html' % page for page in xrange(1,54)]
You could use the CrawlSpider instead of the BaseSpider and use SgmlLinkExtractor to extract the pages in the pagination.
For instance:
start_urls = ["www.example.com/page1"]
rules = ( Rule (SgmlLinkExtractor(restrict_xpaths=('//a[@class="next_page"]',))
, follow= True),
Rule (SgmlLinkExtractor(restrict_xpaths=('//div[@class="foto_imovel"]',))
, callback='parse_call')
)
The first rule tells scrapy to follow the link contained in the xpath expression, the second rule tells scrapy to call the parse_call to links contained in the xpath expression, in case you want to parse something in each page.
For more info please see the doc: http://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider
There can be two use cases for ‘scrapy – parsing items that are paginated’.
A). We just want to move across the table and fetch data. This is relatively straight forward.
class TrainSpider(scrapy.Spider):
name = "trip"
start_urls = ['somewebsite']
def parse(self, response):
''' do something with this parser '''
next_page = response.xpath("//a[@class='next_page']/@href").extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
Observe the last 4 lines. Here
- We are getting the next page link form next page xpath from the ‘Next’ pagination button.
- if condition to check if its not the end of the pagination.
- Join this link (that we got in step 1) with the main url using url join
- A recursive call to the
parse
call back method.
B)Not only we want to move across pages, but we also want to extract data from one or more links in that page.
class StationDetailSpider(CrawlSpider):
name = 'train'
start_urls = [someOtherWebsite]
rules = (
Rule(LinkExtractor(restrict_xpaths="//a[@class='next_page']"), follow=True),
Rule(LinkExtractor(allow=r"/trains/d+$"), callback='parse_trains')
)
def parse_trains(self, response):
'''do your parsing here'''
Overhere, observe that:
-
We are using the CrawlSpider
subclass of the scrapy.Spider
parent class
-
We have set to ‘Rules’
a) The first rule, just checks if there is a ‘next_page’ available and follows it.
b) The second rule requests for all the links on a page that are in the format, say /trains/12343
and then calls the parse_trains
to perform and parsing operation.
-
Important: Note that we don’t want to use the regular parse
method over here as we are using CrawlSpider
subclass. This class also has a parse
method so we don’t want to override that. Just remember to name your call back method something other than parse
.
I have a url of the form:
example.com/foo/bar/page_1.html
There are a total of 53 pages, each one of them has ~20 rows.
I basically want to get all the rows from all the pages, i.e. ~53*20 items.
I have working code in my parse method, that parses a single page, and also goes one page deeper per item, to get more info about the item:
def parse(self, response):
hxs = HtmlXPathSelector(response)
restaurants = hxs.select('//*[@id="contenido-resbus"]/table/tr[position()>1]')
for rest in restaurants:
item = DegustaItem()
item['name'] = rest.select('td[2]/a/b/text()').extract()[0]
# some items don't have category associated with them
try:
item['category'] = rest.select('td[3]/a/text()').extract()[0]
except:
item['category'] = ''
item['urbanization'] = rest.select('td[4]/a/text()').extract()[0]
# get profile url
rel_url = rest.select('td[2]/a/@href').extract()[0]
# join with base url since profile url is relative
base_url = get_base_url(response)
follow = urljoin_rfc(base_url,rel_url)
request = Request(follow, callback = parse_profile)
request.meta['item'] = item
return request
def parse_profile(self, response):
item = response.meta['item']
# item['address'] = figure out xpath
return item
The question is, how do I crawl each page?
example.com/foo/bar/page_1.html
example.com/foo/bar/page_2.html
example.com/foo/bar/page_3.html
...
...
...
example.com/foo/bar/page_53.html
You have two options to solve your problem. The general one is to use yield
to generate new requests instead of return
. That way you can issue more than one new request from a single callback. Check the second example at http://doc.scrapy.org/en/latest/topics/spiders.html#basespider-example.
In your case there is probably a simpler solution: Just generate the list of start urs from a patter like this:
class MySpider(BaseSpider):
start_urls = ['http://example.com/foo/bar/page_%s.html' % page for page in xrange(1,54)]
You could use the CrawlSpider instead of the BaseSpider and use SgmlLinkExtractor to extract the pages in the pagination.
For instance:
start_urls = ["www.example.com/page1"]
rules = ( Rule (SgmlLinkExtractor(restrict_xpaths=('//a[@class="next_page"]',))
, follow= True),
Rule (SgmlLinkExtractor(restrict_xpaths=('//div[@class="foto_imovel"]',))
, callback='parse_call')
)
The first rule tells scrapy to follow the link contained in the xpath expression, the second rule tells scrapy to call the parse_call to links contained in the xpath expression, in case you want to parse something in each page.
For more info please see the doc: http://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider
There can be two use cases for ‘scrapy – parsing items that are paginated’.
A). We just want to move across the table and fetch data. This is relatively straight forward.
class TrainSpider(scrapy.Spider):
name = "trip"
start_urls = ['somewebsite']
def parse(self, response):
''' do something with this parser '''
next_page = response.xpath("//a[@class='next_page']/@href").extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
Observe the last 4 lines. Here
- We are getting the next page link form next page xpath from the ‘Next’ pagination button.
- if condition to check if its not the end of the pagination.
- Join this link (that we got in step 1) with the main url using url join
- A recursive call to the
parse
call back method.
B)Not only we want to move across pages, but we also want to extract data from one or more links in that page.
class StationDetailSpider(CrawlSpider):
name = 'train'
start_urls = [someOtherWebsite]
rules = (
Rule(LinkExtractor(restrict_xpaths="//a[@class='next_page']"), follow=True),
Rule(LinkExtractor(allow=r"/trains/d+$"), callback='parse_trains')
)
def parse_trains(self, response):
'''do your parsing here'''
Overhere, observe that:
-
We are using the
CrawlSpider
subclass of thescrapy.Spider
parent class -
We have set to ‘Rules’
a) The first rule, just checks if there is a ‘next_page’ available and follows it.
b) The second rule requests for all the links on a page that are in the format, say
/trains/12343
and then calls theparse_trains
to perform and parsing operation. -
Important: Note that we don’t want to use the regular
parse
method over here as we are usingCrawlSpider
subclass. This class also has aparse
method so we don’t want to override that. Just remember to name your call back method something other thanparse
.