Allow duplicate downloads with Scrapy Image Pipeline?
Question:
Please see below an example version of my code, which uses the Scrapy Image Pipeline to download/scrape images from a site:
import scrapy
from scrapy_splash import SplashRequest
from imageExtract.items import ImageextractItem
class ExtractSpider(scrapy.Spider):
name = 'extract'
start_urls = ['url']
def parse(self, response):
image = ImageextractItem()
titles = ['a', 'b', 'c', 'd', 'e', 'f']
rel = ['url1', 'url2', 'url3', 'url4', 'url5', 'url6']
image['title'] = titles
image['image_urls'] = rel
return image
It all works fine but as per default settings, avoids downloading duplicates. Is there any way of overriding this so that I can download the duplicates also? Thanks.
Answers:
I think one possible solution is to create your own image pipeline inherited from scrapy.pipelines.images.ImagesPipeline
with overridden method get_media_requests
(see documentation for example). While yielding the scrapy.Request
, pass dont_filter=True
to the constructor.
Thanks to Tomáš’s instruction, eventually I found a way to download duplicate images.
In _process_request
of class MediaPipeline
, I comment these lines.
# Return cached result if request was already seen
# if fp in info.downloaded:
# return defer_result(info.downloaded[fp]).addCallbacks(cb, eb)
# Check if request is downloading right now to avoid doing it twice
# if fp in info.downloading:
# return wad
An uncatched KeyError would occur but it seems not affect my result so I stopped digging further.
To overcome KeyError mentioned by Rick, what I did was:
Look for the function _cache_result_and_execute_waiters
also in the class MediaPipeline
, you will see a similar if case as shown below
if isinstance(result, Failure):
# minimize cached information for failure
result.cleanFailure()
result.frames = []
result.stack = None
I added another if case to check if fp
is in info.waiting
, and everything after that go inside this case
if fp in info.waiting:
info.downloading.remove(fp)
info.downloaded[fp] = result # cache result
for wad in info.waiting.pop(fp):
defer_result(result).chainDeferred(wad)
In the debug log, the path name in "images"
of your scrapy Item is still incorrect though. But I got it saved in the correct path by creating a list of image names for all my "image_urls"
# Return cached result if request was already seen
if fp in info.downloaded:
return defer_result(info.downloaded[fp]).addCallbacks(cb, eb)
# Otherwise, wait for result
wad = Deferred().addCallbacks(cb, eb)
info.waiting[fp].append(wad)
Since the fp was fingerprint of a request, which implements below:
def request_fingerprint(
request: Request,
include_headers: Optional[Iterable[Union[bytes, str]]] = None,
keep_fragments: bool = False,
) -> str:
"""
Return the request fingerprint as an hexadecimal string.
The request fingerprint is a hash that uniquely identifies the resource the
request points to. For example, take the following two urls:
http://www.example.com/query?id=111&cat=222
http://www.example.com/query?cat=222&id=111
Even though those are two different URLs both point to the same resource
and are equivalent (i.e. they should return the same response).
...
I think it would be more gracefully by adding some random params to the image url, instead of commenting some source code.
Like this:
...
class YourImagePipelineClass(ImagesPipeline):
def get_media_requests(self, item, info):
url = item.get('img_url') + '?<some_params_key>=%s' % str(time.time())
yield scrapy.Request(url, meta=item, dont_filter=True)
...
Please see below an example version of my code, which uses the Scrapy Image Pipeline to download/scrape images from a site:
import scrapy
from scrapy_splash import SplashRequest
from imageExtract.items import ImageextractItem
class ExtractSpider(scrapy.Spider):
name = 'extract'
start_urls = ['url']
def parse(self, response):
image = ImageextractItem()
titles = ['a', 'b', 'c', 'd', 'e', 'f']
rel = ['url1', 'url2', 'url3', 'url4', 'url5', 'url6']
image['title'] = titles
image['image_urls'] = rel
return image
It all works fine but as per default settings, avoids downloading duplicates. Is there any way of overriding this so that I can download the duplicates also? Thanks.
I think one possible solution is to create your own image pipeline inherited from scrapy.pipelines.images.ImagesPipeline
with overridden method get_media_requests
(see documentation for example). While yielding the scrapy.Request
, pass dont_filter=True
to the constructor.
Thanks to Tomáš’s instruction, eventually I found a way to download duplicate images.
In _process_request
of class MediaPipeline
, I comment these lines.
# Return cached result if request was already seen
# if fp in info.downloaded:
# return defer_result(info.downloaded[fp]).addCallbacks(cb, eb)
# Check if request is downloading right now to avoid doing it twice
# if fp in info.downloading:
# return wad
An uncatched KeyError would occur but it seems not affect my result so I stopped digging further.
To overcome KeyError mentioned by Rick, what I did was:
Look for the function _cache_result_and_execute_waiters
also in the class MediaPipeline
, you will see a similar if case as shown below
if isinstance(result, Failure):
# minimize cached information for failure
result.cleanFailure()
result.frames = []
result.stack = None
I added another if case to check if fp
is in info.waiting
, and everything after that go inside this case
if fp in info.waiting:
info.downloading.remove(fp)
info.downloaded[fp] = result # cache result
for wad in info.waiting.pop(fp):
defer_result(result).chainDeferred(wad)
In the debug log, the path name in "images"
of your scrapy Item is still incorrect though. But I got it saved in the correct path by creating a list of image names for all my "image_urls"
# Return cached result if request was already seen
if fp in info.downloaded:
return defer_result(info.downloaded[fp]).addCallbacks(cb, eb)
# Otherwise, wait for result
wad = Deferred().addCallbacks(cb, eb)
info.waiting[fp].append(wad)
Since the fp was fingerprint of a request, which implements below:
def request_fingerprint(
request: Request,
include_headers: Optional[Iterable[Union[bytes, str]]] = None,
keep_fragments: bool = False,
) -> str:
"""
Return the request fingerprint as an hexadecimal string.
The request fingerprint is a hash that uniquely identifies the resource the
request points to. For example, take the following two urls:
http://www.example.com/query?id=111&cat=222
http://www.example.com/query?cat=222&id=111
Even though those are two different URLs both point to the same resource
and are equivalent (i.e. they should return the same response).
...
I think it would be more gracefully by adding some random params to the image url, instead of commenting some source code.
Like this:
...
class YourImagePipelineClass(ImagesPipeline):
def get_media_requests(self, item, info):
url = item.get('img_url') + '?<some_params_key>=%s' % str(time.time())
yield scrapy.Request(url, meta=item, dont_filter=True)
...