Scrapy Playwright Page Method: Prevent timeout error if selector cannot be located

Question

My question is related to Scrapy Playwright and how to prevent the Page of a Spider from crashing, if in the course of applying a PageMethod a specific selector cannot be located.

Below is a Scrapy Spider that uses Playwright to interact with the website.
The spider waits for the cookie button to appear and then clicks it.
The selector as well as the actions are defined in the meta attribute of the Request object and here in a dictionary in a list called page_methods.
If the GDPR button is not present, the Page crashes with a timeout error:
playwright._impl._errors.TimeoutError: Timeout 30000ms exceeded.

from typing import Iterable
import scrapy
from scrapy_playwright.page import PageMethod

GDPR_BUTTON_SELECTOR = "iframe[id^='sp_message_iframe'] >> internal_control=enter-frame >> .sp_choice_type_11"


class GuardianSpider(scrapy.Spider):
    name = "guardian"
    allowed_domains = ["www.theguardian.com"]
    start_urls = ["https://www.theguardian.com"]

    def start_requests(self) -> Iterable[scrapy.Request]:
        url = "https://www.theguardian.com"
        yield scrapy.Request(
            url,
            meta=dict(
                playwright=True,
                playwright_include_page=True,
                playwright_page_methods=[
                    PageMethod("wait_for_selector", GDPR_BUTTON_SELECTOR),
                    PageMethod("dispatch_event", GDPR_BUTTON_SELECTOR, "click"),
                ],
            ),
        )

    def parse(self, response):
        pass

If you run the spider, and the Cookie button is present, everything works fine.
However, if the Cookie button is not present, the spider crashes with a timeout error.

This is not how I would like to handle the GDPR button. I would like to have a function that checks if the button is present and then clicks it.
Below is a function in plain Python-playwright that does exactly that. The function accepts a Page object and checks if the GDPR button is present. If it is, it clicks it. If it is not, it does nothing.

from playwright.sync_api import Page

def accecpt_gdpr(page: Page) -> None:
    if page.locator(GDPR_BUTTON_SELECTOR).count():
        page.locator(GDPR_BUTTON_SELECTOR).dispatch_event("click")

How can I achieve the same functionality inside the Scrapy Spider?

Asked By: muw

||

Source

Answer 1

Try this:

doesGdprButtonExist = page.query_selector(GDPR_BUTTON_SELECTOR)

if doesGdprButtonExist:
  page.locator(GDPR_BUTTON_SELECTOR).dispatch_event("click")
else
  dosomethingelse..

Answered By: Zahidul Islam

Answer 2

I figured out how to achieve this. The question is in principle answered here in the documentation of scrapy-playwright.

from typing import Iterable
import scrapy
from playwright.async_api import Page

GDPR_BUTTON_SELECTOR = "iframe[id^='sp_message_iframe'] >> internal_control=enter-frame >> .sp_choice_type_11"


async def accecpt_gdpr(page: Page) -> None:
    if page.locator(GDPR_BUTTON_SELECTOR).count():
        await page.locator(GDPR_BUTTON_SELECTOR).dispatch_event("click")


class GuardianSpider(scrapy.Spider):
    name = "guardian"
    allowed_domains = ["www.theguardian.com"]
    start_urls = ["https://www.theguardian.com"]

    def start_requests(self) -> Iterable[scrapy.Request]:
        url = "https://www.theguardian.com"
        yield scrapy.Request(
            url,
            meta=dict(
                playwright=True,
                playwright_include_page=True,
            ),
        )

    async def parse(self, response):
        self.page = response.meta["playwright_page"]
        await self.accecpt_gdpr(self.page)
        # start scraping the page here

Answered By: muw

Scrapy Playwright Page Method: Prevent timeout error if selector cannot be located

Question:

Answers: