Python requests result doesn't match the website because of JavaScript

Question:

I’m trying to scrape links of products from a webpage (url below). The page uses JavaScript. I tried different libraries, but the links don’t show up in the results (the links have the format */product/*, as you can see by hovering over product links when you open the below url).

url = 'https://www.bcliquorstores.com/product-catalogue?categoryclass=coolers%20%26%20ciders&special=new%20product&sort=name.raw:asc&page=1'

headers = {
    'Host': 'www.bcliquorstores.com',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/111.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.7,fa;q=0.3',
}

Using requests Library:

import requests
res = requests.get(url, headers=headers)

Using urllib library

import urllib.request
request = urllib.request.Request(url, headers=headers)
response = urllib.request.urlopen(request)
response.read().decode()

Using requests_html library:

from requests_html import HTMLSession, AsyncHTMLSession
asession = AsyncHTMLSession()
r = await asession.get(url, headers=headers)
await r.html.arender()
res = r.html.html

When I search for the string /product/ in the results, it cannot be found, but it’s visible from the inspect window.

I know about Selenium, but I want to use it only if there is no other way.

Asked By: LoMaPh

||

Answers:

Such websites do not usually use any sort of protection from scraping because they need to present their product for as many customers as possible.

If a page is dynamic (like one in your question), it can receive data only via HTTP request or Websocket connection. It can easily be obtained.

Open web inspector (F12) in your browser and reload the url page in question. You can see that after the page is loaded the browser makes a request to other API endpoint (…/ajax/…). It can be seen in "Network" tab.

So you can make a very simple script to get use of that endpoint:

import requests


def print_product_urls():
    resp = requests.get(
        'https://www.bcliquorstores.com/ajax/browse?special=new+product&'
        'categoryclass=coolers+&+ciders&sort=name.raw:asc&size=24&page=1',
        headers={
            'Accept': 'application/json, text/plain, */*',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                          'AppleWebKit/537.36 (KHTML, like Gecko) '
                          'Chrome/111.0.0.0 Safari/537.36',
        }
    )
    for product in resp.json()['hits']['hits']:
        print(f'https://www.bcliquorstores.com/product/{product["_id"]}')


if __name__ == '__main__':
    print_product_urls()

There is no other way than running javascript using selenium or another library

Hello

Answered By: Ivan Vinogradov