Same file downloads

Question

I have a problem with my script such that the same file name, and pdf is downloading. I have checked the output of my results without downloadfile and I get unique data. It’s when I use the pipeline that it somehow produces duplicates for download.

Here’s my script:

import scrapy
from environment.items import fcpItem

class fscSpider(scrapy.Spider):
    name = 'fsc'
    start_urls = ['https://fsc.org/en/members']

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(
                url, 
                callback = self.parse
            )
    
    def parse(self, response):
        content = response.xpath("(//div[@class='content__wrapper field field--name-field-content field--type-entity-reference-revisions field--label-hidden field__items']/div[@class='content__item even field__item'])[position() >1]")
        loader = fcpItem()
        names_add = response.xpath(".//div[@class = 'field__item resource-item']/article//span[@class='media-caption file-caption']/text()").getall()
        url = response.xpath(".//div[@class = 'field__item resource-item']/article/div[@class='actions']/a//@href").getall()
        
        pdf=[response.urljoin(x) for x in  url if '#' is not x]
        names = [x.split(' ')[0] for x in names_add]
        for nm, pd in zip(names, pdf):
            loader['names'] = nm
            loader['pdfs'] = [pd]
            yield loader

items.py

class fcpItem(scrapy.Item):
    names = Field()
    pdfs = Field()
    results = Field()

pipelines.py


class DownfilesPipeline(FilesPipeline):
    def file_path(self, request, response=None, info=None, item=None):
        items = item['names']+'.pdf'
        return items

settings.py

from pathlib import Path
import os

BASE_DIR = Path(__file__).resolve().parent.parent
FILES_STORE = os.path.join(BASE_DIR, 'fsc')

ROBOTSTXT_OBEY = False

FILES_URLS_FIELD = 'pdfs'
FILES_RESULT_FIELD = 'results'

ITEM_PIPELINES = {

    'environment.pipelines.pipelines.DownfilesPipeline': 150
}

Asked By: Dollar Tune-bill

||

Source

Answer 1

The problem is that you are overwriting the same scrapy item every iteration.

What you need to do is create a new item for each time your parse method yields. I have tested this and confirmed that it does produce the results you desire.

I made and inline not in my example below on the line that needs to be changed.

For example:

import scrapy
from environment.items import fcpItem

class fscSpider(scrapy.Spider):
    name = 'fsc'
    start_urls = ['https://fsc.org/en/members']

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(
                url, 
                callback = self.parse
            )
    
    def parse(self, response):
        content = response.xpath("(//div[@class='content__wrapper field field--name-field-content field--type-entity-reference-revisions field--label-hidden field__items']/div[@class='content__item even field__item'])[position() >1]")
        names_add = response.xpath(".//div[@class = 'field__item resource-item']/article//span[@class='media-caption file-caption']/text()").getall()
        url = response.xpath(".//div[@class = 'field__item resource-item']/article/div[@class='actions']/a//@href").getall()
        pdf=[response.urljoin(x) for x in  url if '#' is not x]
        names = [x.split(' ')[0] for x in names_add]
        for nm, pd in zip(names, pdf):
            loader = fcpItem()  # Here you create a new item each iteration
            loader['names'] = nm
            loader['pdfs'] = [pd]
            yield loader

Answered By: Alexander

Answer 2

I am using css instead of xpath.

From the chrome debug panel, the tag is root of item of PDF list.
Under that div tag has title of PDF and tag for file download URL
Between root tag and tag two child’s and sibling relation so xpath is not clean method and hard, a css much better is can easley pick up from root to . it don’t necessary relation ship path. css can skip relationship and just sub/or grand sub is not matter. It also provides not necessary to consider index problem which is URL array and title array sync by index match.

Other key point are URL path decoding and file_urls needs to set array type even if single item.

fsc_spider.py

import scrapy
import urllib.parse
from quotes.items import fcpItem

class fscSpider(scrapy.Spider):
    name = 'fsc'
    start_urls = [
        'https://fsc.org/en/members',
    ]

    def parse(self, response):
        for book in response.css('div.field__item.resource-item'):
            url = urllib.parse.unquote(book.css('div.actions a::attr(href)').get(), encoding='utf-8', errors='replace')
            url_left = url[0:url.rfind('/')]+'/'
            title = book.css('span.media-caption.file-caption::text').get()

            item = fcpItem()
            item['original_file_name'] = title.replace(' ','_')
            item['file_urls'] = ['https://fsc.org'+url_left+title.replace(' ','%20')]
            yield item

items.py

import scrapy

class fcpItem(scrapy.Item):
    file_urls = scrapy.Field()
    files = scrapy.Field
    original_file_name = scrapy.Field()

pipelines.py

import scrapy
from scrapy.pipelines.files import FilesPipeline

class fscPipeline(FilesPipeline):
    def file_path(self, request, response=None, info=None):
        file_name: str = request.url.split("/")[-1].replace('%20','_')
        return file_name

settings.py

BOT_NAME = 'quotes'

FILES_STORE =  'downloads'
SPIDER_MODULES = ['quotes.spiders']
NEWSPIDER_MODULE = 'quotes.spiders'
FEED_EXPORT_ENCODING = 'utf-8'
ROBOTSTXT_OBEY = True
ITEM_PIPELINES = { 'quotes.pipelines.fscPipeline': 1}

file structure

execution

quotes>scrapy crawl fsc

result

Answered By: Bench Vue

Same file downloads

Question:

Answers: