scrapy.spidermiddlewares.offsite DEBUG: Filtered offsite request to the website I want to scrape. Why don't I get to parse method?

Question

My goal is to print something from the parse method when I iterate through the for loop in get_membership_no method.

I am using python3.8.5, Scrapy 1.7.3 when I run the code mentioned bellow I get "Filtered offsite request".
Here is the console output.

And here is my code.

import scrapy
import json
class BasisMembersSpider(scrapy.Spider):
    name = 'basis'
    allowed_domains = ['www.basis.org.bd']

    def start_requests(self):

        yield scrapy.Request(url="https://basis.org.bd/get-member-list?page=1&team=", callback=self.get_membership_no)


    def get_membership_no(self, response):

        data_array = json.loads(response.body)['data']

        for data in data_array:

            yield scrapy.Request(url='https://basis.org.bd/get-company-profile/{0}'.format(data['membership_no']), callback=self.parse)


    def parse(self, response):
        print("I want to get this line on console. thank you.")

Asked By: Kamrul Hasan

||

Source

Answer 1

The reason for this behavior is that you set allowed_domains = ['www.basis.org.bd'], which blocks requests to basis.org.bd.
You can either leave allowed_domains out completely or extend your list of allowed domains like this:

allowed_domains = ['www.basis.org.bd', 'basis.org.bd']

See the documentation for allowed_domains here for more information.

Answered By: Patrick Klein

Answer 2

removing "www." from allowed_domains worked for me. Thank you

This article is really helpful

Answered By: HarsH

scrapy.spidermiddlewares.offsite DEBUG: Filtered offsite request to the website I want to scrape. Why don't I get to parse method?

Question:

Answers: