Scrapy spider not found error
Question:
This is Windows 7 with python 2.7
I have a scrapy project in a directory called caps (this is where scrapy.cfg is)
My spider is located in capscapsspiderscampSpider.py
I cd into the scrapy project and try to run
scrapy crawl campSpider -o items.json -t json
I get an error that the spider can’t be found. The class name is campSpider
...
spider = self.crawler.spiders.create(spname, **opts.spargs)
File "c:Python27libsite-packagesscrapy-0.14.0.2841-py2.7-win32.eggscrapyspidermanager.py", l
ine 43, in create
raise KeyError("Spider not found: %s" % spider_name)
KeyError: 'Spider not found: campSpider'
Am I missing some configuration item?
Answers:
Have you set up the SPIDER_MODULES setting?
SPIDER_MODULES
Default: []
A list of modules where Scrapy will look for spiders.
Example:
SPIDER_MODULES = ['mybot.spiders_prod', 'mybot.spiders_dev']
Make sure you have set the “name” property of the spider.
Example:
class campSpider(BaseSpider):
name = 'campSpider'
Without the name property, the scrapy manager will not be able to find your spider.
Also make sure that your project is not called scrapy
! I made that mistake and renaming it fixed the problem.
make sure that your spider file is saved in your spider directory. the Crawler looks for the spider name in the spider directory
You have to give a name to your spider.
However, BaseSpider is deprecated, use Spider instead.
from scrapy.spiders import Spider
class campSpider(Spider):
name = 'campSpider'
The project should have been created by the startproject command:
scrapy startproject project_name
Which gives you the following directory tree:
project_name/
scrapy.cfg # deploy configuration file
project_name/ # project's Python module, you'll import your code from here
__init__.py
items.py # project items file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you'll later put your spiders
__init__.py
...
Make sure that settings.py has the definition of your spider module.
eg:
BOT_NAME = 'bot_name' # Usually equals to your project_name
SPIDER_MODULES = ['project_name.spiders']
NEWSPIDER_MODULE = 'project_name.spiders'
You should have no problems to run your spider locally or on ScrappingHub.
Check indentation too, the class for my spider was indented one tab. Somehow that makes the class invalid or something.
Try running scrapy list
on the command line. If there is any error on the spider it will detect it.
In my case, I was bluntly copy code from another project and forget to change the project name from the spider module import
For anyone who might have the same problem, not only you need to set the name
of the spider and check for SPIDER_MODULES
and NEWSPIDER_MODULE
in your scrapy settings, if you are running a scrapyd
service, you also need to restart in order to apply any change you have made
Name attribute in CrawlSpider class defines the spider name and this name is used in command line for calling the spider to work.
import json
from scrapy import Spider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractor import LinkExtractor
class NameSpider(CrawlSpider):
name = 'name of spider'
allowed_domains = ['allowed domains of web portal to be scrapped']
start_urls = ['start url of of web portal to be scrapped']
custom_settings = {
'DOWNLOAD_DELAY': 1,
'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
}
product_css = ['.main-menu']
rules = [
Rule(LinkExtractor(restrict_css=product_css), callback='parse'),
]
def parse(self, response):
//implementation of business logic
without project use runspider and fileName
with project use crawl and name
sample : C/user> scrapy runspider myFile.py
In my case, i set ‘LOG_STDOUT=True’, and scrapyd can not return the results to json response when you are looking for your spiders with ‘/listspiders.json’. And instead of that, the results are being printed to the log files you set at scrapyd’s default_scrapyd.conf file.
So, I changed the settings as this, and it worked well.
LOG_STDOUT = False
I also had this problem,and it turned out to be rather small. Be sure your class inherits from scrapy.Spider
my_class(scrapy.Spider):
Ahh Yes, you should enter the value of your ‘name variable value’.
I.e.
import scrapy
class QuoteSpider(scrapy.Spider):
name = 'quotes'
start_urls = [
'http://quotes.toscrape.com/'
]
def parse(self, response):
title = response.css('title').extract()
yield {'titleText' : title}
So in this case, the name = ‘quotes’.
Then in your command line you enter:
‘scrapy crawl quotes’
That was my problem.
If you are following the tutorial from https://docs.scrapy.org/en/latest/intro/tutorial.html
Then do something like:
$ sudo apt install python-pip
$ pip install Scrapy
(logout, login)
$ cd
$ scrapy startproject tutorial
$ vi ~/tutorial/tutorial/spiders/quotes_spider.py
$ cd ~/tutorial/tutorial
$ scrapy crawl quotes
The error happens if you try to create the spiders directory yourself under ~/tutorial
Also, it is possible that you have not deployed your spider. SO first use “scrapyd” to up the server and then use “scrapyd-deploy” to deploy and then run the command.
Sometime this strange behaviour is caused by LOG_STDOUT = True
It defaults to False
though, so check it and if it is set to True
– try to set it to default
LOG_STDOUT = False
This is a logged issue
I had the same issue. When i was using “scrapy list” in cmd the command listed the spider name i was getting the error for, in the list, but while i tried to run it with scrapy crawl SpiderName.py, i used to get Scrapy spider not found error.
I have used this spider before and everything was fine with it.
So i used the secret weapon, i restarted my system and the issue was resolved π
Ensure same name attribute is used in the command line for running spider …
scrapy crawl
I fixed by fixing my filename.
Originally, my.spider.py
. Fixed, myspider.py
.
I’m very new to python and scrapy so I’m not sure if this is a dumb mistake on my part.
just to add my learning point here.
I had my crawler working, it suddenly started giving the error and came here to find the solution.
couldn’t fix it, so checked my changes and i stupidly created a new variable “name”.
This causes scrapy to not find the spider name also.
An improper name for the python file could lead to this error (for example crawler.py
or scrapy.py
).
I solved a problem like this by running the spider from the directory but just from where the "spider.cfg" file is located, NOT the complete route, in which in your case the campSpider.py is located (capscapsspiderscampSpider.py).
Well, try only from caps
Also from this place I suggest running the command: scrapy list
this would show you the spiders you have created.
I hope it helps anyone.
Omitting the file extension for the spider file can also lead to this error. If instead of my-project/spiders/my-spider.py
you name your file my-project/spiders/my-spider
you will get exactly this error.
This is Windows 7 with python 2.7
I have a scrapy project in a directory called caps (this is where scrapy.cfg is)
My spider is located in capscapsspiderscampSpider.py
I cd into the scrapy project and try to run
scrapy crawl campSpider -o items.json -t json
I get an error that the spider can’t be found. The class name is campSpider
...
spider = self.crawler.spiders.create(spname, **opts.spargs)
File "c:Python27libsite-packagesscrapy-0.14.0.2841-py2.7-win32.eggscrapyspidermanager.py", l
ine 43, in create
raise KeyError("Spider not found: %s" % spider_name)
KeyError: 'Spider not found: campSpider'
Am I missing some configuration item?
Have you set up the SPIDER_MODULES setting?
SPIDER_MODULES
Default:
[]
A list of modules where Scrapy will look for spiders.
Example:
SPIDER_MODULES = ['mybot.spiders_prod', 'mybot.spiders_dev']
Make sure you have set the “name” property of the spider.
Example:
class campSpider(BaseSpider):
name = 'campSpider'
Without the name property, the scrapy manager will not be able to find your spider.
Also make sure that your project is not called scrapy
! I made that mistake and renaming it fixed the problem.
make sure that your spider file is saved in your spider directory. the Crawler looks for the spider name in the spider directory
You have to give a name to your spider.
However, BaseSpider is deprecated, use Spider instead.
from scrapy.spiders import Spider
class campSpider(Spider):
name = 'campSpider'
The project should have been created by the startproject command:
scrapy startproject project_name
Which gives you the following directory tree:
project_name/
scrapy.cfg # deploy configuration file
project_name/ # project's Python module, you'll import your code from here
__init__.py
items.py # project items file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you'll later put your spiders
__init__.py
...
Make sure that settings.py has the definition of your spider module.
eg:
BOT_NAME = 'bot_name' # Usually equals to your project_name
SPIDER_MODULES = ['project_name.spiders']
NEWSPIDER_MODULE = 'project_name.spiders'
You should have no problems to run your spider locally or on ScrappingHub.
Check indentation too, the class for my spider was indented one tab. Somehow that makes the class invalid or something.
Try running scrapy list
on the command line. If there is any error on the spider it will detect it.
In my case, I was bluntly copy code from another project and forget to change the project name from the spider module import
For anyone who might have the same problem, not only you need to set the name
of the spider and check for SPIDER_MODULES
and NEWSPIDER_MODULE
in your scrapy settings, if you are running a scrapyd
service, you also need to restart in order to apply any change you have made
Name attribute in CrawlSpider class defines the spider name and this name is used in command line for calling the spider to work.
import json
from scrapy import Spider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractor import LinkExtractor
class NameSpider(CrawlSpider):
name = 'name of spider'
allowed_domains = ['allowed domains of web portal to be scrapped']
start_urls = ['start url of of web portal to be scrapped']
custom_settings = {
'DOWNLOAD_DELAY': 1,
'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
}
product_css = ['.main-menu']
rules = [
Rule(LinkExtractor(restrict_css=product_css), callback='parse'),
]
def parse(self, response):
//implementation of business logic
without project use runspider and fileName
with project use crawl and name
sample : C/user> scrapy runspider myFile.py
In my case, i set ‘LOG_STDOUT=True’, and scrapyd can not return the results to json response when you are looking for your spiders with ‘/listspiders.json’. And instead of that, the results are being printed to the log files you set at scrapyd’s default_scrapyd.conf file.
So, I changed the settings as this, and it worked well.
LOG_STDOUT = False
I also had this problem,and it turned out to be rather small. Be sure your class inherits from scrapy.Spider
my_class(scrapy.Spider):
Ahh Yes, you should enter the value of your ‘name variable value’.
I.e.
import scrapy
class QuoteSpider(scrapy.Spider):
name = 'quotes'
start_urls = [
'http://quotes.toscrape.com/'
]
def parse(self, response):
title = response.css('title').extract()
yield {'titleText' : title}
So in this case, the name = ‘quotes’.
Then in your command line you enter:
‘scrapy crawl quotes’
That was my problem.
If you are following the tutorial from https://docs.scrapy.org/en/latest/intro/tutorial.html
Then do something like:
$ sudo apt install python-pip
$ pip install Scrapy
(logout, login)
$ cd
$ scrapy startproject tutorial
$ vi ~/tutorial/tutorial/spiders/quotes_spider.py
$ cd ~/tutorial/tutorial
$ scrapy crawl quotes
The error happens if you try to create the spiders directory yourself under ~/tutorial
Also, it is possible that you have not deployed your spider. SO first use “scrapyd” to up the server and then use “scrapyd-deploy” to deploy and then run the command.
Sometime this strange behaviour is caused by LOG_STDOUT = True
It defaults to False
though, so check it and if it is set to True
– try to set it to default
LOG_STDOUT = False
This is a logged issue
I had the same issue. When i was using “scrapy list” in cmd the command listed the spider name i was getting the error for, in the list, but while i tried to run it with scrapy crawl SpiderName.py, i used to get Scrapy spider not found error.
I have used this spider before and everything was fine with it.
So i used the secret weapon, i restarted my system and the issue was resolved π
Ensure same name attribute is used in the command line for running spider …
scrapy crawl
I fixed by fixing my filename.
Originally, my.spider.py
. Fixed, myspider.py
.
I’m very new to python and scrapy so I’m not sure if this is a dumb mistake on my part.
just to add my learning point here.
I had my crawler working, it suddenly started giving the error and came here to find the solution.
couldn’t fix it, so checked my changes and i stupidly created a new variable “name”.
This causes scrapy to not find the spider name also.
An improper name for the python file could lead to this error (for example crawler.py
or scrapy.py
).
I solved a problem like this by running the spider from the directory but just from where the "spider.cfg" file is located, NOT the complete route, in which in your case the campSpider.py is located (capscapsspiderscampSpider.py).
Well, try only from caps
Also from this place I suggest running the command: scrapy list
this would show you the spiders you have created.
I hope it helps anyone.
Omitting the file extension for the spider file can also lead to this error. If instead of my-project/spiders/my-spider.py
you name your file my-project/spiders/my-spider
you will get exactly this error.