Scrapy requests – Callback funtion not being called in nested requests
Question:
I am trying to scrape some products from amazon in order to get some information on my competitors. This is the process I am adopting:
Make a query in the search bar ->
Visit every product page of the results gotten from the query ->
Gather information from that product ->
Check if the product matches the quantity that we looked for (I.E. We might want to collect only products sold in a pack of n items ... like a kit of n toner cartridges)
-> If it does, yield the item.
-> If not, find a variation in that ad that represents a pack of such n items
-> If such a variation exists, go visit that variation of the product, modify some information of the item (such as price and asin) and yield that item.
I have a particular case here. I will not post the entire functions I have but I will rather post some representative functions instead (in order to keep it shorter and more general so that maybe it is going to be useful to someone else in the future).
Here is the structure of my code:
def start_requests(self):
for i, prod in enumerate(products):
url = 'https://www.amazon.it/s?' + urlencode({'k': prod['query']})
competitors = scrapy.Request(url=url, callback=self.parse_keyword_response, meta={'prod':prod})
yield competitors
def parse_keyword_response(self, response):
# Function that loops on the results of the query made,
# and collects all the products that actually match our search
products = response.xpath('//*[@data-asin]')
prod = response.meta['prod']
competitors =[]
for product in products:
asin = product.xpath('@data-asin').extract_first()
product_url = f"https://www.amazon.it/dp/{asin}"
competitor = scrapy.Request(url=product_url, callback=self.parse_competitor_product_page, meta={'asin': asin, 'prod':prod})
yield competitor
competitors.append(competitor)
def parse_competitor_product_page(self, response):
# Function that scrapes information from a product page and yields the competitor
# only if it actually matches our search.
' Do some work and scrape required product attributes'
competitor = ProductItem()
competitor['product'] = prod_name
competitor['asin'] = asin
competitor['Title'] = title
competitor['producer'] = producer
competitor['MainImage'] = image
competitor['Rating'] = rating
competitor['NumberOfReviews'] = number_of_reviews
competitor['price'] = price
competitor['AvailableSizes'] = sizes
competitor['AvailableColors'] = colors
competitor['Varieties'] = varieties
competitor['BulletPoints'] = bullet_points
competitor['SellerRank'] = seller_rank
if self.is_right_product(prod, competitor, response):
yield competitor
def is_right_product(self, product, competitor, response):
# Function that checks whether a resulting competitor actually matches the product that
# we looked for. It returns a boolean if it does. It also alters some attributes of that
# competitor if a right variation is found on its page.
' I will omit some if else branches as those work well and I will only post the faulty
branch (which happens to be the one that should modify the competitor object because
a right variation is found on its page. '
if product_is_right_quantity(competitor):
return True
else:
variation = find_variation_of_right_quantity(product['quantity'], competitor)
if vatiation is not None:
competitor = self..update_product_to_right_variation(competitor, variation, response)
print("variation check done")
return True
else:
return False
def update_product_to_right_variation(self, product, variation_name, response):
print("IN UPDATE PRODUCT TO RIGHT VARIATION")
variation_asin = response.xpath(f'//div[@id="variation_color_name"]/ul/li[contains(@title, '{variation_name}')]/@data-defaultasin').get()
product_url = f"https://www.amazon.it/dp/{variation_asin}"
print(product_url)
yield scrapy.Request(url=product_url, callback=self.update_competitor_from_product_page, errback=self.errback_http, meta={'prod':product, 'asin':variation_asin})
def update_competitor_from_product_page(self, response):
print("INSIIDE UPDATE COMPETITOR FROM PRODUCT PAGE")
prod = response.meta['prod']
asin = response.meta['asin']
price = response.xpath('//*[@id="priceblock_ourprice"]/text()').extract_first()
prod['price'] = price
prod['Title'] = title
prod['asin'] = asin
response.meta['prod'] = prod
print(prod['price'])
return prod
As you can see I placed some print statements for debugging purposes.
The print statements in update_competitor_from_product_page never get output.
All the others do. So that function that should be used as a callback function of the request made in update_product_to_right_variation never gets called. As a consequence, the competitor object remains unchanged.
I am new to async programming and new to Scrapy as well.
First of all, I would like to know why my callback function never gets called. Secondly, how can I do what I have in mind?
Answers:
I can’t test it but problem can be that you try to yield Request
in function parse_competitor_product_page()
which is executed in function is_right_product()
which is executed in parse_competitor_product_page()
– but yield
/return
in function parse_competitor_product_page()
can’t send it directly to Scrapy Engine but it sends it to previous function is_right_product()
which should yield
/return
it to previous function parse_competitor_product_page()
– and in parse_competitor_product_page()
you should yield
it and then it will send it it Scrapy
Engine which will execute it.
In your code you yield Request
from parse_competitor_product_page()
to is_right_product()
but in is_right_product()
you send return True
/return False
so it doesn’t send Request
to parse_competitor_product_page()
and it can’t send it to Scrapy engine
I think you need something like this
def parse_competitor_product_page(self, response):
# Function that scrapes information from a product page and yields the competitor
# only if it actually matches our search.
' Do some work and scrape required product attributes'
competitor = ProductItem()
competitor['product'] = prod_name
competitor['asin'] = asin
competitor['Title'] = title
competitor['producer'] = producer
competitor['MainImage'] = image
competitor['Rating'] = rating
competitor['NumberOfReviews'] = number_of_reviews
competitor['price'] = price
competitor['AvailableSizes'] = sizes
competitor['AvailableColors'] = colors
competitor['Varieties'] = varieties
competitor['BulletPoints'] = bullet_points
competitor['SellerRank'] = seller_rank
variaton = self.is_right_product(prod, competitor):
if variation is True or variation is None:
# send to Scarpy's Engine: ProductItem without changes
yield competitor
else:
# send to Scarpy's Engine: Request to page with variation
yield self.update_product_to_right_variation(competitor, variation)
def is_right_product(self, product, competitor):
# Function that checks whether a resulting competitor actually matches the product that
# we looked for. It returns a boolean if it does. It also alters some attributes of that
# competitor if a right variation is found on its page.
'''I will omit some if else branches as those work well and I will only post the faulty
branch (which happens to be the one that should modify the competitor object because
a right variation is found on its page. '''
if product_is_right_quantity(competitor):
return True # it will assing `True` to `variaton = ...` in `parse_competitor_product_page()`
# it will assing `variation` or `None` to `variaton = ...` in `parse_competitor_product_page()`
return find_variation_of_right_quantity(product['quantity'], competitor)
def update_product_to_right_variation(self, competitor, variation_asin):
print("IN UPDATE PRODUCT TO RIGHT VARIATION")
variation_asin = response.xpath(f'//div[@id="variation_color_name"]/ul/li[contains(@title, '{variation_name}')]/@data-defaultasin').get()
product_url = f"https://www.amazon.it/dp/{variation_asin}"
print(product_url)
# send back to `parse_competitor_product_page()`
return scrapy.Request(url=product_url,
callback=self.update_competitor_from_product_page,
errback=self.errback_http,
meta={'prod':competitor, 'asin':variation_asin})
def update_competitor_from_product_page(self, response):
print("INSIIDE UPDATE COMPETITOR FROM PRODUCT PAGE")
prod = response.meta['prod']
asin = response.meta['asin']
price = response.xpath('//*[@id="priceblock_ourprice"]/text()').extract_first()
#title = ...
prod['price'] = price
prod['Title'] = title
prod['asin'] = asin
#response.meta['prod'] = prod # useless
print(prod['price'])
# send to Scarpy's Engine: item with changes
yield prod
I am trying to scrape some products from amazon in order to get some information on my competitors. This is the process I am adopting:
Make a query in the search bar ->
Visit every product page of the results gotten from the query ->
Gather information from that product ->
Check if the product matches the quantity that we looked for (I.E. We might want to collect only products sold in a pack of n items ... like a kit of n toner cartridges)
-> If it does, yield the item.
-> If not, find a variation in that ad that represents a pack of such n items
-> If such a variation exists, go visit that variation of the product, modify some information of the item (such as price and asin) and yield that item.
I have a particular case here. I will not post the entire functions I have but I will rather post some representative functions instead (in order to keep it shorter and more general so that maybe it is going to be useful to someone else in the future).
Here is the structure of my code:
def start_requests(self):
for i, prod in enumerate(products):
url = 'https://www.amazon.it/s?' + urlencode({'k': prod['query']})
competitors = scrapy.Request(url=url, callback=self.parse_keyword_response, meta={'prod':prod})
yield competitors
def parse_keyword_response(self, response):
# Function that loops on the results of the query made,
# and collects all the products that actually match our search
products = response.xpath('//*[@data-asin]')
prod = response.meta['prod']
competitors =[]
for product in products:
asin = product.xpath('@data-asin').extract_first()
product_url = f"https://www.amazon.it/dp/{asin}"
competitor = scrapy.Request(url=product_url, callback=self.parse_competitor_product_page, meta={'asin': asin, 'prod':prod})
yield competitor
competitors.append(competitor)
def parse_competitor_product_page(self, response):
# Function that scrapes information from a product page and yields the competitor
# only if it actually matches our search.
' Do some work and scrape required product attributes'
competitor = ProductItem()
competitor['product'] = prod_name
competitor['asin'] = asin
competitor['Title'] = title
competitor['producer'] = producer
competitor['MainImage'] = image
competitor['Rating'] = rating
competitor['NumberOfReviews'] = number_of_reviews
competitor['price'] = price
competitor['AvailableSizes'] = sizes
competitor['AvailableColors'] = colors
competitor['Varieties'] = varieties
competitor['BulletPoints'] = bullet_points
competitor['SellerRank'] = seller_rank
if self.is_right_product(prod, competitor, response):
yield competitor
def is_right_product(self, product, competitor, response):
# Function that checks whether a resulting competitor actually matches the product that
# we looked for. It returns a boolean if it does. It also alters some attributes of that
# competitor if a right variation is found on its page.
' I will omit some if else branches as those work well and I will only post the faulty
branch (which happens to be the one that should modify the competitor object because
a right variation is found on its page. '
if product_is_right_quantity(competitor):
return True
else:
variation = find_variation_of_right_quantity(product['quantity'], competitor)
if vatiation is not None:
competitor = self..update_product_to_right_variation(competitor, variation, response)
print("variation check done")
return True
else:
return False
def update_product_to_right_variation(self, product, variation_name, response):
print("IN UPDATE PRODUCT TO RIGHT VARIATION")
variation_asin = response.xpath(f'//div[@id="variation_color_name"]/ul/li[contains(@title, '{variation_name}')]/@data-defaultasin').get()
product_url = f"https://www.amazon.it/dp/{variation_asin}"
print(product_url)
yield scrapy.Request(url=product_url, callback=self.update_competitor_from_product_page, errback=self.errback_http, meta={'prod':product, 'asin':variation_asin})
def update_competitor_from_product_page(self, response):
print("INSIIDE UPDATE COMPETITOR FROM PRODUCT PAGE")
prod = response.meta['prod']
asin = response.meta['asin']
price = response.xpath('//*[@id="priceblock_ourprice"]/text()').extract_first()
prod['price'] = price
prod['Title'] = title
prod['asin'] = asin
response.meta['prod'] = prod
print(prod['price'])
return prod
As you can see I placed some print statements for debugging purposes.
The print statements in update_competitor_from_product_page never get output.
All the others do. So that function that should be used as a callback function of the request made in update_product_to_right_variation never gets called. As a consequence, the competitor object remains unchanged.
I am new to async programming and new to Scrapy as well.
First of all, I would like to know why my callback function never gets called. Secondly, how can I do what I have in mind?
I can’t test it but problem can be that you try to yield Request
in function parse_competitor_product_page()
which is executed in function is_right_product()
which is executed in parse_competitor_product_page()
– but yield
/return
in function parse_competitor_product_page()
can’t send it directly to Scrapy Engine but it sends it to previous function is_right_product()
which should yield
/return
it to previous function parse_competitor_product_page()
– and in parse_competitor_product_page()
you should yield
it and then it will send it it Scrapy
Engine which will execute it.
In your code you yield Request
from parse_competitor_product_page()
to is_right_product()
but in is_right_product()
you send return True
/return False
so it doesn’t send Request
to parse_competitor_product_page()
and it can’t send it to Scrapy engine
I think you need something like this
def parse_competitor_product_page(self, response):
# Function that scrapes information from a product page and yields the competitor
# only if it actually matches our search.
' Do some work and scrape required product attributes'
competitor = ProductItem()
competitor['product'] = prod_name
competitor['asin'] = asin
competitor['Title'] = title
competitor['producer'] = producer
competitor['MainImage'] = image
competitor['Rating'] = rating
competitor['NumberOfReviews'] = number_of_reviews
competitor['price'] = price
competitor['AvailableSizes'] = sizes
competitor['AvailableColors'] = colors
competitor['Varieties'] = varieties
competitor['BulletPoints'] = bullet_points
competitor['SellerRank'] = seller_rank
variaton = self.is_right_product(prod, competitor):
if variation is True or variation is None:
# send to Scarpy's Engine: ProductItem without changes
yield competitor
else:
# send to Scarpy's Engine: Request to page with variation
yield self.update_product_to_right_variation(competitor, variation)
def is_right_product(self, product, competitor):
# Function that checks whether a resulting competitor actually matches the product that
# we looked for. It returns a boolean if it does. It also alters some attributes of that
# competitor if a right variation is found on its page.
'''I will omit some if else branches as those work well and I will only post the faulty
branch (which happens to be the one that should modify the competitor object because
a right variation is found on its page. '''
if product_is_right_quantity(competitor):
return True # it will assing `True` to `variaton = ...` in `parse_competitor_product_page()`
# it will assing `variation` or `None` to `variaton = ...` in `parse_competitor_product_page()`
return find_variation_of_right_quantity(product['quantity'], competitor)
def update_product_to_right_variation(self, competitor, variation_asin):
print("IN UPDATE PRODUCT TO RIGHT VARIATION")
variation_asin = response.xpath(f'//div[@id="variation_color_name"]/ul/li[contains(@title, '{variation_name}')]/@data-defaultasin').get()
product_url = f"https://www.amazon.it/dp/{variation_asin}"
print(product_url)
# send back to `parse_competitor_product_page()`
return scrapy.Request(url=product_url,
callback=self.update_competitor_from_product_page,
errback=self.errback_http,
meta={'prod':competitor, 'asin':variation_asin})
def update_competitor_from_product_page(self, response):
print("INSIIDE UPDATE COMPETITOR FROM PRODUCT PAGE")
prod = response.meta['prod']
asin = response.meta['asin']
price = response.xpath('//*[@id="priceblock_ourprice"]/text()').extract_first()
#title = ...
prod['price'] = price
prod['Title'] = title
prod['asin'] = asin
#response.meta['prod'] = prod # useless
print(prod['price'])
# send to Scarpy's Engine: item with changes
yield prod