Scrapy Crawl (referer: None)

Question:

I am new to scrapy and python I am scrapping data from Aliexpress.com with playwright method and it returns (referer: None): Here is my code

class AliSpider(scrapy.Spider):
    name = "aliex"

    def start_requests(self):
        # GET request
        search_value = 'phones'
        yield scrapy.Request(f"https://www.aliexpress.com/premium/{search_value}.html?spm=a2g0o.productlist.1000002.0&initiative_id=SB_20230118063054&dida=y",
         meta=dict(
            playwright= True,
            playwright_include_page = True,
            playwright_page_coroutines =[
                PageMethod('wait_for_selector', '.list--gallery--34TropR')
            ]
         ))
    

    async def parse(self, response):
        for data in response.xpath("//h1"):
            related_link = data.xpath(".//text()").get()
            yield{
                'related_link':related_link
            }

I am getting

2023-01-18 19:56:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.aliexpress.com/wholesale?SearchText=phones&spm=a2g0o.productlist.1000002.0&initiative_id=SB_20230118063054&dida=y> (referer: None)
2023-01-18 19:56:55 [scrapy.core.engine] INFO: Closing spider (finished)

I tried with both xpath and css selector but results same. Anyone can help me please

Asked By: Sarfraz

||

Answers:

Here is the complete solution using standalone playwright with python which works find with windows.The website loaded data dynamicaly via JavaScript that’s why I use
page.evaluate() method to execute JavaScript and scroll the entire page, otherwise, it will not scrape the complete ResultSets.

Script:

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
import pandas as pd
import time

data = []
with sync_playwright() as p:
    browser = p.chromium.launch(headless=False)
    context = browser.new_context()
    page = context.new_page()
    search_value = 'phones'
    for page_num in range(1,4):
       
        page.goto(f"https://www.aliexpress.com/wholesale?SearchText=phones&catId=0&dida=y&g=y&initiative_id=SB_20230118063054&page={page_num}&spm=a2g0o.productlist.1000002.0&trafficChannel=main")
        page.wait_for_selector('[class="manhattan--content--1KpBbUi"]',timeout=30000)
        scroll_height = page.evaluate("""() => {
                                return Math.max(
                                  document.body.scrollHeight, document.documentElement.scrollHeight,
                                  document.body.offsetHeight, document.documentElement.offsetHeight,
                                  document.body.clientHeight, document.documentElement.clientHeight
                                );
                            }""")
        current_height = 0
        while current_height < scroll_height:
            current_height = page.evaluate("""() => {
                                window.scrollBy(0, window.innerHeight);
                                return window.scrollY;
                            }""")
            time.sleep(2)
        soup = BeautifulSoup(page.content(), 'lxml')
        for card in soup.select('[class="manhattan--content--1KpBbUi"]'):
            title = card.h1.text
            data.append({'title':title})

df = pd.DataFrame(data)
print(df)

Output:

                 title
0    Unlock Samsung Galaxy S10 S10+ s10e G970U G973...
1    SERVO K07 Plus mini Mobile Phone Pen Dual SIM ...
2    BLACKVIEW OSCAL C80 Smartphone 6.5" Waterdrop ...
3    Original Apple iPhone 7 Unlocked 99% New Mobil...
4    [World Premiere] Blackview BV9200 Rugged Smart...
..                                                 ...
175  Motorola StarTAC Rainbow 500mAh Fashion 90% Ne...
176  Original International Version HuaWei P30 Pro ...
177  Unlocked Original Apple iPhone SE Dual Core 2G...
178  2022 Unihertz TANK Large Battery Rugged Smartp...
179  75W Car Wireless Charger Car Mount Phone Holde...

[180 rows x 1 columns]
Answered By: Md. Fazlul Hoque