Scrapy – only returns yield from 1 url of list
Question:
I’m crawling a website which has many countries, ex: amazon.com, .mx, .fr, .de, .es,…
(the website is not actually amazon)
I’ve made a url list of each base url and call parse with each one:
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, callback=self.parse)
I also have a list of keywords that it will search ex: toshiba, apple, hp, ibm,…
In the parse function I loop through each keyword and create a build url
ex: amazon.de/search?={keyword} in the correct format, this in turn calls another callback function:
def parse(self, response):
for keyword in self.keywords:
...(make build url with keyword)...
yield scrapy.Request(build_url, meta={'current_page': 1}, callback=self.crawl)
The function crawl will get the href of each listing on the page and follow it with a callback:
def crawl:
...(finds the href of each listing)...
for listing in listings:
yield scrapy.Request(self.main_url + href, callback=self.followListing,meta={'data':data})
data is a scrapy item which I fill in followListing()
with the data I’m interested in.
ex: name, description, price, image_urls, etc… This final callback ends with a yield data
which I would like to save in an output file.
def followListing(self, response):
...(fills up data item)...
yield data
when I run my crawler:
scrapy crawl my_crawler -o output.json
The output.json file only contains the listings from one of the url’s (ex: amazon.mx), each time I run it it can contain the listings of a different url but only from one of them.
I suppose that as soon as one finishes it saves to the output file so the first one finished is the only one saved. Is this the case? How can I get the output with the yield from all of them or is there something else I’m missing?
Answers:
There are a number of issues with your spider, including the way you are constructing the urls and sending initial requests. Plus there are quite a few areas that have a lot of room for improvement, not least of which is the fact that you are using beautiful soup instead of the built in scrapy selectors.
- The Data class is not a proper scrapy item class, and the way you use it in your spider kind of makes it irrelevant too since you end up yielding all the information as a dictionary anyway. You should either change it like in the example below or remove it entirely.
class Data(scrapy.Item):
url: str = scrapy.Field()
description: str = scrapy.Field()
user: str = scrapy.Field()
images: list = scrapy.Field()
- probably the main issue is in how you are generating the initial requests. In your example you have a list of base urls that yield a request for each of them, then in your parse callback you construct the path and parameters from other static attributes. Since you can easily construct the url without dispatching the base address, it makes all of those initial requests useless. Instead what you should do you convert the parse method into a regular method that constructs each of the full urls and feeds them back to the
start_requests
method to yield as the initial requests.
class Example(scrapy.Spider):
name = 'example'
search = '/search?category=&keyword='
keywords = ['terrains', 'maison', 'house', 'land']
main_url = ''
search_term = ''
def gen_requests(self, url):
for keyword in self.keywords:
build_url = url + self.search
kws = keyword.split(' ')
if (len(kws)>1):
for (i, val) in enumerate(kws):
if (i == 0):
build_url += val
else:
build_url += f'+{val}'
else:
build_url += kws[0]
yield scrapy.Request(build_url, callback=self.parse)
def start_requests(self):
urls = ['https://ci.coinafrique.com', 'https://sn.coinafrique.com', 'https://tg.coinafrique.com', 'https://bj.coinafrique.com']
for url in urls:
for request in self.gen_requests(url):
yield request
- Beautiful soup is slow, especially when using the ‘html.parser’ backend, it is also much more verbose. I suggest using the built in scrapy xpath and css selectors to parse the html and extract the information.
def parse(self, response):
for listing in response.css('div.col.s6.m4'):
href = listing.xpath('.//p[@class="ad__card-description"]/a/@href').get()
yield scrapy.Request(response.urljoin(href), callback=self.followListing)
def followListing(self, response):
url = response.url
description = response.xpath('//div[@class="ad__info__box ad__info__box-descriptions"]//text()').getall()[1]
profile = response.css('div.profile-card__content')
user = profile.xpath('.//p[@class="username"]//text()').get()
images = []
for image in response.xpath('//div[contains(@class,"slide-clickable")]/@style').re(r'url((.*))'):
images.append(image)
yield Data(
url=url,
description=description,
user = user,
images = images
)
Altogether this example does not produce the issue you described in your question. It successfully crawls each of the results from the initial base urls.
import scrapy
class Data(scrapy.Item):
url: str = scrapy.Field()
description: str = scrapy.Field()
user: str = scrapy.Field()
images: list = scrapy.Field()
class Example(scrapy.Spider):
name = 'example'
search = '/search?category=&keyword='
keywords = ['terrains', 'maison', 'house', 'land']
main_url = ''
search_term = ''
def gen_requests(self, url):
for keyword in self.keywords:
build_url = url + self.search
kws = keyword.split(' ')
if (len(kws)>1):
for (i, val) in enumerate(kws):
if (i == 0):
build_url += val
else:
build_url += f'+{val}'
else:
build_url += kws[0]
yield scrapy.Request(build_url, callback=self.parse)
def start_requests(self):
urls = ['https://ci.coinafrique.com', 'https://sn.coinafrique.com', 'https://tg.coinafrique.com', 'https://bj.coinafrique.com']
for url in urls:
for request in self.gen_requests(url):
yield request
def parse(self, response):
for listing in response.css('div.col.s6.m4'):
href = listing.xpath('.//p[@class="ad__card-description"]/a/@href').get()
yield scrapy.Request(response.urljoin(href), callback=self.followListing)
def followListing(self, response):
url = response.url
description = response.xpath('//div[@class="ad__info__box ad__info__box-descriptions"]//text()').getall()[1]
profile = response.css('div.profile-card__content')
user = profile.xpath('.//p[@class="username"]//text()').get()
images = []
for image in response.xpath('//div[contains(@class,"slide-clickable")]/@style').re(r'url((.*))'):
images.append(image)
yield Data(
url=url,
description=description,
user = user,
images = images
)
I’m crawling a website which has many countries, ex: amazon.com, .mx, .fr, .de, .es,…
(the website is not actually amazon)
I’ve made a url list of each base url and call parse with each one:
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, callback=self.parse)
I also have a list of keywords that it will search ex: toshiba, apple, hp, ibm,…
In the parse function I loop through each keyword and create a build url
ex: amazon.de/search?={keyword} in the correct format, this in turn calls another callback function:
def parse(self, response):
for keyword in self.keywords:
...(make build url with keyword)...
yield scrapy.Request(build_url, meta={'current_page': 1}, callback=self.crawl)
The function crawl will get the href of each listing on the page and follow it with a callback:
def crawl:
...(finds the href of each listing)...
for listing in listings:
yield scrapy.Request(self.main_url + href, callback=self.followListing,meta={'data':data})
data is a scrapy item which I fill in followListing()
with the data I’m interested in.
ex: name, description, price, image_urls, etc… This final callback ends with a yield data
which I would like to save in an output file.
def followListing(self, response):
...(fills up data item)...
yield data
when I run my crawler:
scrapy crawl my_crawler -o output.json
The output.json file only contains the listings from one of the url’s (ex: amazon.mx), each time I run it it can contain the listings of a different url but only from one of them.
I suppose that as soon as one finishes it saves to the output file so the first one finished is the only one saved. Is this the case? How can I get the output with the yield from all of them or is there something else I’m missing?
There are a number of issues with your spider, including the way you are constructing the urls and sending initial requests. Plus there are quite a few areas that have a lot of room for improvement, not least of which is the fact that you are using beautiful soup instead of the built in scrapy selectors.
- The Data class is not a proper scrapy item class, and the way you use it in your spider kind of makes it irrelevant too since you end up yielding all the information as a dictionary anyway. You should either change it like in the example below or remove it entirely.
class Data(scrapy.Item):
url: str = scrapy.Field()
description: str = scrapy.Field()
user: str = scrapy.Field()
images: list = scrapy.Field()
- probably the main issue is in how you are generating the initial requests. In your example you have a list of base urls that yield a request for each of them, then in your parse callback you construct the path and parameters from other static attributes. Since you can easily construct the url without dispatching the base address, it makes all of those initial requests useless. Instead what you should do you convert the parse method into a regular method that constructs each of the full urls and feeds them back to the
start_requests
method to yield as the initial requests.
class Example(scrapy.Spider):
name = 'example'
search = '/search?category=&keyword='
keywords = ['terrains', 'maison', 'house', 'land']
main_url = ''
search_term = ''
def gen_requests(self, url):
for keyword in self.keywords:
build_url = url + self.search
kws = keyword.split(' ')
if (len(kws)>1):
for (i, val) in enumerate(kws):
if (i == 0):
build_url += val
else:
build_url += f'+{val}'
else:
build_url += kws[0]
yield scrapy.Request(build_url, callback=self.parse)
def start_requests(self):
urls = ['https://ci.coinafrique.com', 'https://sn.coinafrique.com', 'https://tg.coinafrique.com', 'https://bj.coinafrique.com']
for url in urls:
for request in self.gen_requests(url):
yield request
- Beautiful soup is slow, especially when using the ‘html.parser’ backend, it is also much more verbose. I suggest using the built in scrapy xpath and css selectors to parse the html and extract the information.
def parse(self, response):
for listing in response.css('div.col.s6.m4'):
href = listing.xpath('.//p[@class="ad__card-description"]/a/@href').get()
yield scrapy.Request(response.urljoin(href), callback=self.followListing)
def followListing(self, response):
url = response.url
description = response.xpath('//div[@class="ad__info__box ad__info__box-descriptions"]//text()').getall()[1]
profile = response.css('div.profile-card__content')
user = profile.xpath('.//p[@class="username"]//text()').get()
images = []
for image in response.xpath('//div[contains(@class,"slide-clickable")]/@style').re(r'url((.*))'):
images.append(image)
yield Data(
url=url,
description=description,
user = user,
images = images
)
Altogether this example does not produce the issue you described in your question. It successfully crawls each of the results from the initial base urls.
import scrapy
class Data(scrapy.Item):
url: str = scrapy.Field()
description: str = scrapy.Field()
user: str = scrapy.Field()
images: list = scrapy.Field()
class Example(scrapy.Spider):
name = 'example'
search = '/search?category=&keyword='
keywords = ['terrains', 'maison', 'house', 'land']
main_url = ''
search_term = ''
def gen_requests(self, url):
for keyword in self.keywords:
build_url = url + self.search
kws = keyword.split(' ')
if (len(kws)>1):
for (i, val) in enumerate(kws):
if (i == 0):
build_url += val
else:
build_url += f'+{val}'
else:
build_url += kws[0]
yield scrapy.Request(build_url, callback=self.parse)
def start_requests(self):
urls = ['https://ci.coinafrique.com', 'https://sn.coinafrique.com', 'https://tg.coinafrique.com', 'https://bj.coinafrique.com']
for url in urls:
for request in self.gen_requests(url):
yield request
def parse(self, response):
for listing in response.css('div.col.s6.m4'):
href = listing.xpath('.//p[@class="ad__card-description"]/a/@href').get()
yield scrapy.Request(response.urljoin(href), callback=self.followListing)
def followListing(self, response):
url = response.url
description = response.xpath('//div[@class="ad__info__box ad__info__box-descriptions"]//text()').getall()[1]
profile = response.css('div.profile-card__content')
user = profile.xpath('.//p[@class="username"]//text()').get()
images = []
for image in response.xpath('//div[contains(@class,"slide-clickable")]/@style').re(r'url((.*))'):
images.append(image)
yield Data(
url=url,
description=description,
user = user,
images = images
)