Unfortunately I don’t have enough population to make a comment, so I have to make this new question, referring to https://stackoverflow.com/questions/23105590/how-to-get-the-pipeline-object-in-scrapy-spider
I have many urls in a DB. So I want to get the start_url from my db. So far not a big problem.
Well I don’t want the mysql things inside the spider and in the pipeline I get a problem.
If I try to hand over the pipeline object to my spider like in the referred question, I only get an Attribute Error with the message
'None Type' object has no attribute getUrl
I think the actual problem is that the function spider_opened doesn’t get called (also inserted a print statement which never showed its output in the console).
Has somebody an idea how to get the pipeline object inside the spider?
def __init__(self): self.pipe = None def start_requests(self): url = self.pipe.getUrl() scrapy.Request(url,callback=self.parse)
@classmethod def from_crawler(cls, crawler): pipeline = cls() crawler.signals.connect(pipeline.spider_opened, signals.spider_opened) def spider_opened(self, spider): spider.pipe = self def getUrl(self): ...
Scrapy pipelines already have expected methods of
This method is called when the spider is opened.
Parameters: spider (Spider object) – the spider which was opened
This method is called when the spider is closed.
Parameters: spider (Spider object) – the spider which was closed
However your original issue doesn’t make much sense, why do you want to assign pipeline reference to your spider? That seems like a very bad idea.
What you should do is open up db and read urls in your spider itself.
from scrapy import Spider class MySpider(Spider): name = 'myspider' start_urls =  @classmethod def from_crawler(self, crawler, *args, **kwargs): spider = super().from_crawler(crawler, *args, **kwargs) spider.start_urls = self.get_urls_from_db() return spider def get_urls_from_db(self): db = # get db cursor here urls = # use cursor to pop your urls return urls
I’m using accepted solution but not works as expected.
TypeError: get_urls_from_db() missing 1 required positional argument: 'self'
Here’s the worked one from my side
from scrapy import Spider class MySpider(Spider): name = 'myspider' start_urls =  def __init__(self, db_dsn): self.db_dsn = db_dsn self.start_urls = self.get_urls_from_db(db_dsn) @classmethod def from_crawler(cls, crawler): spider = cls( db_dsn=os.getenv('DB_DSN', 'mongodb://localhost:27017'), ) spider._set_crawler(crawler) return spider def get_urls_from_db(self, db_dsn): db = # get db cursor here urls = # use cursor to pop your urls return urls