Scrapy: Why basic Request after logging in sign me out using Scrapy

Question:

I really don’t understand why a basic request after log-in signs me out using scrapy, I have raised several questions on the various Scrapy forums(question links, reddit, github, Stackoverflow) but they don’t simply provide an answer to this. I can easily achieve this with selenium without any issue, replicating the same with scrapy now appears to be a problem, I have tried more than 50 different SO solutions. I just need a reason why I get logged out once I spawn another request after logging in.

Here is the basic Selenium and Scrapy script for that, with dummy account details to sign in with.

from selenium.webdriver import Chrome, ChromeOptions
from selenium.webdriver.chrome.service import Service


#define our URL
url = 'https://www.oddsportal.com/login/'
username = 'chuky'
password = 'A151515a'
path = r'C:UsersGlodarisOneDriveDesktopRepoScraperchromedriver.exe'
webdriver_service = Service(path)
options = ChromeOptions()


# options=options
browser = Chrome(service=webdriver_service, options=options)

browser.get(url)
browser.implicitly_wait(2)
browser.find_element(By.ID, 'onetrust-accept-btn-handler').click()
browser.find_element(By.ID,'login-username1').send_keys(username)
browser.find_element(By.ID,'login-password1').send_keys(password)
browser.implicitly_wait(10)
browser.find_element(By.XPATH,'//*[@id="col-content"]//button[@class="inline-btn-2"]').click()#.send_keys(self.password)

print('successful login')
browser.implicitly_wait(10)
browser.get('https://www.oddsportal.com/results/')

Scrapy

class OddsportalSpider(CrawlSpider):
    name = 'oddsportal'
    allowed_domains = ['oddsportal.com']  
    # start_urls = ['http://oddsportal.com/results/']
    login_page = 'https://www.oddsportal.com/login/'

    def start_requests(self):
        """called before crawling starts. Try to login"""
        yield scrapy.Request(
        url=self.login_page,
        callback=self.login,
        dont_filter=True    
        )
    # parse response
    def login(self, response):
        """Generate a login request."""

        yield FormRequest.from_response(
             response=response,
              formdata={'login-username': 'chuky', 
                  'login-password': 'A151515a',
                  'login-submit': '',
                },
              callback=self.after_login,
              dont_filter=True
              )
    #simply check if log-in was successful, and spawn another request to /results/
    def after_login(self, response):

        if b"Wrong username or password" in response.body:
            logger.warning("LOGIN ATTEMPT FAILED")
            return
        else:
            logger.info("LOGIN ATTEMPT SUCCESSFUL")
            url = 'https://www.oddsportal.com/results/'
            return  scrapy.Request(url=url,callback=self.parse_item,  dont_filter=True) 
    def parse_item(self, response):  
        print( 'Thissssssssss----------------------',response.url)
        open_in_browser(response) 

I get signed out once I spawn a request to /results/ after a successful log in. It was said that scrapy handles cookies by default, I have tried sending cookies and headers alongside every request, but that didn’t work. please I need someone to try this from another end and tell me the reason for this because my response shows that I am logged in but sending a request after that log’s me out.

steps to reproduce scrapy response:

  1. scrapy startproject oddsportal
  2. scrapy genspider -t oddsportal oddsportal.com
  3. set user-agent to default scrapy user-agent: USER_AGENT = ‘oddsportal_website (+http://www.yourdomain.com)’
  4. run spider: scrapy crawl oddsportal

Logs

{'BOT_NAME': 'oddsportal_website',
 'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
 'HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage',
 'NEWSPIDER_MODULE': 'oddsportal_website.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['oddsportal_website.spiders']}
2022-08-15 09:47:48 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-08-15 09:47:48 [scrapy.extensions.telnet] INFO: Telnet Password: 66aa39ca3b133f3d
2022-08-15 09:47:48 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2022-08-15 09:47:48 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'oddsportal_website.middlewares.UserAgentRotatorMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy_splash.SplashCookiesMiddleware',
 'scrapy_splash.SplashMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-08-15 09:47:48 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy_splash.SplashDeduplicateArgsMiddleware',
 'log_count/DEBUG': 9,
 'log_count/INFO': 11,
 'request_depth_max': 2,
 'response_received_count': 4,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/200': 1,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'start_time': datetime.datetime(2022, 8, 15, 8, 47, 48, 449490)}
Asked By: benjamin olise

||

Answers:

You are logged in! It just happened so that the username is not part of the response, and it gets loaded either via an API call or using JavaScript with cookies (you can do view page source on results page and search for chuky you won’t find it), and since Scrapy only loads the response from the URL you set (no JS or other API calls) it won’t show up. A good way to confirm that you are logged in is to go to https://www.oddsportal.com/settings/ which has the username in the HTML

Answered By: zaki98

I was able to get the right result by using the method described in the post

Scrapy-Splash Session Handling

Answered By: benjamin olise