Scrapy: Get Start_Urls from Database by Pipeline

Question

Unfortunately I don’t have enough population to make a comment, so I have to make this new question, referring to https://stackoverflow.com/questions/23105590/how-to-get-the-pipeline-object-in-scrapy-spider

I have many urls in a DB. So I want to get the start_url from my db. So far not a big problem.
Well I don’t want the mysql things inside the spider and in the pipeline I get a problem.
If I try to hand over the pipeline object to my spider like in the referred question, I only get an Attribute Error with the message

'None Type' object has no attribute getUrl

I think the actual problem is that the function spider_opened doesn’t get called (also inserted a print statement which never showed its output in the console).
Has somebody an idea how to get the pipeline object inside the spider?

MySpider.py

def __init__(self):
    self.pipe = None

def start_requests(self):
    url = self.pipe.getUrl()
    scrapy.Request(url,callback=self.parse)

Pipeline.py

@classmethod
def from_crawler(cls, crawler):
        pipeline = cls()
        crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
def spider_opened(self, spider):
    spider.pipe = self

def getUrl(self):
     ...

Asked By: phenixclaw

||

Source

Answer 1

Scrapy pipelines already have expected methods of open_spider and close_spider

Taken from docs: https://doc.scrapy.org/en/latest/topics/item-pipeline.html#open_spider

open_spider(self, spider)
This method is called when the spider is opened.
Parameters: spider (Spider object) – the spider which was opened

close_spider(self, spider)
This method is called when the spider is closed.
Parameters: spider (Spider object) – the spider which was closed

However your original issue doesn’t make much sense, why do you want to assign pipeline reference to your spider? That seems like a very bad idea.

What you should do is open up db and read urls in your spider itself.

from scrapy import Spider
class MySpider(Spider):
    name = 'myspider'
    start_urls = []

    @classmethod
    def from_crawler(self, crawler, *args, **kwargs):
        spider = super().from_crawler(crawler, *args, **kwargs)
        spider.start_urls = self.get_urls_from_db()
        return spider

    def get_urls_from_db(self):
        db = # get db cursor here
        urls = # use cursor to pop your urls
        return urls

Answered By: Granitosaurus

Answer 2

I’m using accepted solution but not works as expected.

TypeError: get_urls_from_db() missing 1 required positional argument: 'self'

Here’s the worked one from my side

from scrapy import Spider
class MySpider(Spider):
    name = 'myspider'
    start_urls = []

    def __init__(self, db_dsn):
        self.db_dsn = db_dsn
        self.start_urls = self.get_urls_from_db(db_dsn)

    @classmethod
    def from_crawler(cls, crawler):
        spider = cls(
            db_dsn=os.getenv('DB_DSN', 'mongodb://localhost:27017'),
        )
        spider._set_crawler(crawler)
        return spider        

    def get_urls_from_db(self, db_dsn):
        db = # get db cursor here
        urls = # use cursor to pop your urls
        return urls

Answered By: tukang_ketik

Scrapy: Get Start_Urls from Database by Pipeline

Question:

Answers: