How can I make Scrapy follow the links in order

Question:

I’m doing a small scrape project and everything is working fine, but I’m having a problem with the order of links since Scrapy is synchronous. The ‘rankings["Men’s Pound-for-Pound"]’ is a list of links which I except to be followed on its order, so the output will be in order as well.

Here’s my code:

class FighterSpiderSpider(scrapy.Spider):

    name = 'fighter_spider'

    allowed_domains = ['www.ufc.com.br']

    start_urls = ['https://www.ufc.com.br/rankings']

    def parse(self, response):

        all_rankings = response.css('div.view-grouping').getall() # --> list of all rankings

        champions = {Selector(text=x).css('div.view-grouping div.info h4::text').get().strip() : Selector(text=x).css('a::attr(href)').get() for x in all_rankings}

        rankings = {Selector(text=x).css('div.info h4::text').get().strip() : Selector(text=x).css('a::attr(href)').getall() for x in all_rankings}

        if self.ranking == "p4p male":

            for link in rankings["Men's Pound-for-Pound"]:

                yield response.follow(link, callback=self.parse_date)

Asked By: Samuel Martins

||

Answers:

So there is no way to guarantee that the responses/output will be processed in a specific order. You can manually set the priority for each request which will influence the order in which requests are dispatched from the engine, but it will not guarantee that each response will be processed in the same order.

You can set the priority for requests by simply setting the priority parameter in your requests or response.follow calls.

for i, link in enumerate(rankings["Men's Pound-for-Pound"]):
    yield response.follow(link, callback=self.parse_date, priority=len(rankings["Men's Pound-for-Pound"])) - i)

The higher the value set, the sooner it will be processed.

Since this doesn’t guarantee the output ordering though I would suggest simply passing the rank as a callback keyword argument with the request and then sorting the output in a pipeline or postprocessing procedure.

For example:

class FighterSpiderSpider(scrapy.Spider):

    name = 'fighter_spider'

    allowed_domains = ['www.ufc.com.br']

    start_urls = ['https://www.ufc.com.br/rankings']

    def parse(self, response):

        all_rankings = response.css('div.view-grouping').getall() # --> list of all rankings

        champions = {Selector(text=x).css('div.view-grouping div.info h4::text').get().strip() : Selector(text=x).css('a::attr(href)').get() for x in all_rankings}

        rankings = {Selector(text=x).css('div.info h4::text').get().strip() : Selector(text=x).css('a::attr(href)').getall() for x in all_rankings}

        if self.ranking == "p4p male":

            for i, link in enumerate(rankings["Men's Pound-for-Pound"]):

                yield response.follow(link, callback=self.parse_date, cb_kwargs={"rank": i+1})


    def parse_date(self, response, rank):
        ...
        ...
        yield {'rank': rank ...}

Then you can sort the output into the correct order in a pipeline or post processsing.

Answered By: Alexander
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.