scrapy allow all subdomains
Question:
I want to use Scrapy to crawl a website that it’s pages are divided into a lot of subdomains
I know I need a CrawlSpider
with a Rule
but I need the Rule to be just “allow all subdomains and let the parsers handle themselves according to the data” (meaning – in the example the item_links are in different subdomains)
example for the code:
def parse_page(self, response):
sel = Selector(response)
item_links = sel.xpath("XXXXXXXXX").extract()
for item_link in item_links:
item_request = Request(url=item_link,
callback=self.parse_item)
yield item_request
def parse_item(self, response):
sel = Selector(response)
** EDIT **
Just to make the question clear, I want the ability to crawl all of *.example.com ->
meaning not to get Filtered offsite request to 'foo.example.com'
** ANOTHER EDIT **
Following @agstudy’s answer, make sure you don’t forget to delete allowed_domains = ["www.example.com"]
Answers:
You can set an allow_domains
list for the rule :
rules = (
Rule(SgmlLinkExtractor(allow_domains=('domain1','domain2' ), ),)
For example:
rules = (
Rule(SgmlLinkExtractor(allow_domains=('example.com','example1.com' ), ),)
This will filter allow urls like :
www.example.com/blaa/bla/
www.example1.com/blaa/bla/
www.something.example.com/blaa/bla/
If you are not using rules, but are making use of the allowed_domains
class attribute of the Spider, you can also set allowed_domains = ['example.com']
. That will allow all subdomains of example.com
such as foo.example.com
.
To crawl a website with Scrapy and allow all subdomains, you can use a CrawlSpider with a Rule that does not include a RestrictedDomain constraint. Here is an example of how you can do this:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = "myspider"
allowed_domains = ["example.com"] # Don't forget to delete this line!
start_urls = [
"http://www.example.com/start_page",
]
rules = (
Rule(LinkExtractor(), callback="parse_page", follow=True),
)
def parse_page(self, response):
sel = Selector(response)
item_links = sel.xpath("XXXXXXXXX").extract()
for item_link in item_links:
item_request = Request(url=item_link, callback=self.parse_item)
yield item_request
def parse_item(self, response):
sel = Selector(response)
# Parse the item here
# ...
I want to use Scrapy to crawl a website that it’s pages are divided into a lot of subdomains
I know I need a CrawlSpider
with a Rule
but I need the Rule to be just “allow all subdomains and let the parsers handle themselves according to the data” (meaning – in the example the item_links are in different subdomains)
example for the code:
def parse_page(self, response):
sel = Selector(response)
item_links = sel.xpath("XXXXXXXXX").extract()
for item_link in item_links:
item_request = Request(url=item_link,
callback=self.parse_item)
yield item_request
def parse_item(self, response):
sel = Selector(response)
** EDIT **
Just to make the question clear, I want the ability to crawl all of *.example.com ->
meaning not to get Filtered offsite request to 'foo.example.com'
** ANOTHER EDIT **
Following @agstudy’s answer, make sure you don’t forget to delete allowed_domains = ["www.example.com"]
You can set an allow_domains
list for the rule :
rules = (
Rule(SgmlLinkExtractor(allow_domains=('domain1','domain2' ), ),)
For example:
rules = (
Rule(SgmlLinkExtractor(allow_domains=('example.com','example1.com' ), ),)
This will filter allow urls like :
www.example.com/blaa/bla/
www.example1.com/blaa/bla/
www.something.example.com/blaa/bla/
If you are not using rules, but are making use of the allowed_domains
class attribute of the Spider, you can also set allowed_domains = ['example.com']
. That will allow all subdomains of example.com
such as foo.example.com
.
To crawl a website with Scrapy and allow all subdomains, you can use a CrawlSpider with a Rule that does not include a RestrictedDomain constraint. Here is an example of how you can do this:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = "myspider"
allowed_domains = ["example.com"] # Don't forget to delete this line!
start_urls = [
"http://www.example.com/start_page",
]
rules = (
Rule(LinkExtractor(), callback="parse_page", follow=True),
)
def parse_page(self, response):
sel = Selector(response)
item_links = sel.xpath("XXXXXXXXX").extract()
for item_link in item_links:
item_request = Request(url=item_link, callback=self.parse_item)
yield item_request
def parse_item(self, response):
sel = Selector(response)
# Parse the item here
# ...