Scrapy requests – Callback funtion not being called in nested requests

Question

I am trying to scrape some products from amazon in order to get some information on my competitors. This is the process I am adopting:

Make a query in the search bar ->
Visit every product page of the results gotten from the query -> 
Gather information from that product ->
Check if the product matches the quantity that we looked for (I.E. We might want to collect only products sold in a pack of n items ... like a kit of n toner cartridges)
    -> If it does, yield the item.
    -> If not, find a variation in that ad that represents a pack of such n items
         -> If such a variation exists, go visit that variation of the product, modify some information of the item (such as price and asin) and yield that item.

I have a particular case here. I will not post the entire functions I have but I will rather post some representative functions instead (in order to keep it shorter and more general so that maybe it is going to be useful to someone else in the future).

Here is the structure of my code:

def start_requests(self):
        for i, prod in enumerate(products):
            url = 'https://www.amazon.it/s?' + urlencode({'k': prod['query']})
            competitors = scrapy.Request(url=url, callback=self.parse_keyword_response, meta={'prod':prod})
            yield competitors


def parse_keyword_response(self, response):
        # Function that loops on the results of the query made, 
        # and collects all the products that actually match our search
        products = response.xpath('//*[@data-asin]')
        prod = response.meta['prod']

        competitors =[]

        for product in products:
            asin = product.xpath('@data-asin').extract_first()
            product_url = f"https://www.amazon.it/dp/{asin}"
            competitor = scrapy.Request(url=product_url, callback=self.parse_competitor_product_page, meta={'asin': asin, 'prod':prod})
            yield competitor
            competitors.append(competitor)


def parse_competitor_product_page(self, response):
        # Function that scrapes information from a product page and yields the competitor
        # only if it actually matches our search.

        ' Do some work and scrape required product attributes'

        competitor = ProductItem()
        competitor['product'] = prod_name
        competitor['asin'] = asin
        competitor['Title'] = title
        competitor['producer'] = producer
        competitor['MainImage'] = image
        competitor['Rating'] = rating
        competitor['NumberOfReviews'] = number_of_reviews
        competitor['price'] = price
        competitor['AvailableSizes'] = sizes
        competitor['AvailableColors'] = colors
        competitor['Varieties'] = varieties
        competitor['BulletPoints'] = bullet_points
        competitor['SellerRank'] = seller_rank

        if self.is_right_product(prod, competitor, response):
            yield competitor

def is_right_product(self, product, competitor, response):
       # Function that checks whether a resulting competitor actually matches the product that 
       # we looked for. It returns a boolean if it does. It also alters some attributes of that
       # competitor if a right variation is found on its page.

      ' I will omit some if else branches as those work well and I will only post the faulty 
           branch (which happens to be the one that should modify the competitor object because 
           a right variation is found on its page. '

      if product_is_right_quantity(competitor):
           return True
      else:
           variation = find_variation_of_right_quantity(product['quantity'], competitor)
           if vatiation is not None:
                competitor = self..update_product_to_right_variation(competitor, variation, response)
                print("variation check done")
                return True
           else:
                return False

def update_product_to_right_variation(self, product, variation_name, response):
        print("IN UPDATE PRODUCT TO RIGHT VARIATION")
        variation_asin = response.xpath(f'//div[@id="variation_color_name"]/ul/li[contains(@title, '{variation_name}')]/@data-defaultasin').get()
        product_url = f"https://www.amazon.it/dp/{variation_asin}"
        print(product_url)
        yield scrapy.Request(url=product_url, callback=self.update_competitor_from_product_page, errback=self.errback_http, meta={'prod':product, 'asin':variation_asin})

def update_competitor_from_product_page(self, response):
        print("INSIIDE UPDATE COMPETITOR FROM PRODUCT PAGE")
        prod = response.meta['prod']
        asin = response.meta['asin']

        price = response.xpath('//*[@id="priceblock_ourprice"]/text()').extract_first()

        prod['price'] = price
        prod['Title'] = title
        prod['asin'] = asin

        response.meta['prod'] = prod
        print(prod['price'])
        return prod

As you can see I placed some print statements for debugging purposes.

The print statements in update_competitor_from_product_page never get output.

All the others do. So that function that should be used as a callback function of the request made in update_product_to_right_variation never gets called. As a consequence, the competitor object remains unchanged.

I am new to async programming and new to Scrapy as well.

First of all, I would like to know why my callback function never gets called. Secondly, how can I do what I have in mind?

Asked By: giulio di zio

||

Source

Answer 1

I can’t test it but problem can be that you try to yield Request in function parse_competitor_product_page() which is executed in function is_right_product() which is executed in parse_competitor_product_page() – but yield/return in function parse_competitor_product_page() can’t send it directly to Scrapy Engine but it sends it to previous function is_right_product() which should yield/return it to previous function parse_competitor_product_page() – and in parse_competitor_product_page() you should yield it and then it will send it it Scrapy Engine which will execute it.

In your code you yield Request from parse_competitor_product_page() to is_right_product() but in is_right_product() you send return True/return False so it doesn’t send Request to parse_competitor_product_page() and it can’t send it to Scrapy engine

I think you need something like this

def parse_competitor_product_page(self, response):
    # Function that scrapes information from a product page and yields the competitor
    # only if it actually matches our search.

    ' Do some work and scrape required product attributes'

    competitor = ProductItem()
    competitor['product'] = prod_name
    competitor['asin'] = asin
    competitor['Title'] = title
    competitor['producer'] = producer
    competitor['MainImage'] = image
    competitor['Rating'] = rating
    competitor['NumberOfReviews'] = number_of_reviews
    competitor['price'] = price
    competitor['AvailableSizes'] = sizes
    competitor['AvailableColors'] = colors
    competitor['Varieties'] = varieties
    competitor['BulletPoints'] = bullet_points
    competitor['SellerRank'] = seller_rank

    variaton = self.is_right_product(prod, competitor):
    if variation is True or variation is None:
        # send to Scarpy's Engine: ProductItem without changes
        yield competitor
    else:
        # send to Scarpy's Engine: Request to page with variation
        yield self.update_product_to_right_variation(competitor, variation)


def is_right_product(self, product, competitor):
    # Function that checks whether a resulting competitor actually matches the product that 
    # we looked for. It returns a boolean if it does. It also alters some attributes of that
    # competitor if a right variation is found on its page.

    '''I will omit some if else branches as those work well and I will only post the faulty 
       branch (which happens to be the one that should modify the competitor object because 
       a right variation is found on its page. '''

    if product_is_right_quantity(competitor):
        return True  # it will assing `True` to `variaton = ...` in `parse_competitor_product_page()`
    
    # it will assing `variation` or `None` to `variaton = ...` in `parse_competitor_product_page()`
    return find_variation_of_right_quantity(product['quantity'], competitor)


def update_product_to_right_variation(self, competitor, variation_asin):
    print("IN UPDATE PRODUCT TO RIGHT VARIATION")
    
    variation_asin = response.xpath(f'//div[@id="variation_color_name"]/ul/li[contains(@title, '{variation_name}')]/@data-defaultasin').get()
    
    product_url = f"https://www.amazon.it/dp/{variation_asin}"
    
    print(product_url)
    
    # send back to `parse_competitor_product_page()`
    return scrapy.Request(url=product_url,
                          callback=self.update_competitor_from_product_page,
                          errback=self.errback_http,
                          meta={'prod':competitor, 'asin':variation_asin})


def update_competitor_from_product_page(self, response):
    print("INSIIDE UPDATE COMPETITOR FROM PRODUCT PAGE")
    prod = response.meta['prod']
    asin = response.meta['asin']

    price = response.xpath('//*[@id="priceblock_ourprice"]/text()').extract_first()
    #title = ...
    
    prod['price'] = price
    prod['Title'] = title
    prod['asin'] = asin

    #response.meta['prod'] = prod # useless
    print(prod['price'])
    
    # send to Scarpy's Engine: item with changes
    yield prod

Answered By: furas

Scrapy requests – Callback funtion not being called in nested requests

Question:

Answers: