scrapy allow all subdomains

Question:

I want to use Scrapy to crawl a website that it’s pages are divided into a lot of subdomains
I know I need a CrawlSpider with a Rule but I need the Rule to be just “allow all subdomains and let the parsers handle themselves according to the data” (meaning – in the example the item_links are in different subdomains)

example for the code:

def parse_page(self, response):
    sel = Selector(response)
    item_links = sel.xpath("XXXXXXXXX").extract()
    for item_link in item_links:
            item_request = Request(url=item_link,
                                     callback=self.parse_item)
            yield item_request

def parse_item(self, response):
    sel = Selector(response)

** EDIT **
Just to make the question clear, I want the ability to crawl all of *.example.com ->
meaning not to get Filtered offsite request to 'foo.example.com'

** ANOTHER EDIT **
Following @agstudy’s answer, make sure you don’t forget to delete allowed_domains = ["www.example.com"]

Asked By: Boaz

||

Answers:

You can set an allow_domains list for the rule :

rules = (
       Rule(SgmlLinkExtractor(allow_domains=('domain1','domain2' ), ),)

For example:

rules = (
       Rule(SgmlLinkExtractor(allow_domains=('example.com','example1.com' ), ),)

This will filter allow urls like :

www.example.com/blaa/bla/
www.example1.com/blaa/bla/
www.something.example.com/blaa/bla/
Answered By: agstudy

If you are not using rules, but are making use of the allowed_domains class attribute of the Spider, you can also set allowed_domains = ['example.com']. That will allow all subdomains of example.com such as foo.example.com.

Answered By: bartaelterman

To crawl a website with Scrapy and allow all subdomains, you can use a CrawlSpider with a Rule that does not include a RestrictedDomain constraint. Here is an example of how you can do this:

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
    name = "myspider"
    allowed_domains = ["example.com"]  # Don't forget to delete this line!
    start_urls = [
        "http://www.example.com/start_page",
    ]
    rules = (
        Rule(LinkExtractor(), callback="parse_page", follow=True),
    )

    def parse_page(self, response):
        sel = Selector(response)
        item_links = sel.xpath("XXXXXXXXX").extract()
        for item_link in item_links:
            item_request = Request(url=item_link, callback=self.parse_item)
            yield item_request

    def parse_item(self, response):
        sel = Selector(response)
        # Parse the item here
        # ...
Answered By: Obad
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.