Webscraping Python Website Using JSON Application

Question:

I am trying to get the price of one item on the website in the url below. However, I am finding some issues when looking at the source page of the website.

The url is: https://www.cartier.com/en-gb/love-bracelet-small-model_cod25372685655708131.html#dept=EU_Love

The part of the source page I am interested in is the following (I guess):

<script type="application/ld+json">
    [{

"@context":"http://schema.org",
"@type":"Product",
"productID":"25372685655708131",
"name":"LOVE bracelet, small model",
"description":"#LOVE# bracelet, small model, yellow gold 750/1000. Supplied with a screwdriver. Width: 3.65 mm (for size 17). Now available in a slimmer version, Cartier continues to write the story of the #LOVE# bracelet. Same design, same oval shape, same story: a timeless – yet slightly slimmer – creation which is fastened using a screwdriver. The closure is designed with a functional screw on one side of the bracelet and a hinge on the other. To determine the size of your #LOVE# bracelet, measure your wrist, adding one centimetre to your size for a tighter fit, or two centimetres for a looser fit.",
"image":["https://www.cartier.com/variants/images/25372685655708131/img1/w960.jpg"],
"offers": 
[{"@type":"Offer","availability":"http://schema.org/InStock","priceCurrency":"GBP","price":"4100","sku":"0400574782829","url":"https://www.cartier.com/en-gb/love-bracelet-small-model_cod25372685655708131.html"}]}]
    </script>

I have tried the following steps:

import json
from bs4 import BeautifulSoup
import requests
from multiprocessing import Pool
import pandas as pd

data = {'url':[],'offers_price':[]}

def get_price(url):
    soup = BeautifulSoup(requests.get(url, headers={'User-Agent': 'Mozilla/5.0'}).content, "html.parser")
    data = json.loads(soup.find_all('script', {'type': 'application/ld+json'})[-1].get_text())
    return url, int(data['offers']['price'])

if __name__ == '__main__':

    urls = ['https://www.cartier.com/en-gb/love-bracelet-small-model_cod25372685655708131.html#dept=EU_Love']

    with Pool(processes=4) as pool:
            for url, price in pool.imap_unordered(get_price, urls):
                    data['offers_price'].append(price)
                    data['url'].append(url)
    print(data)

But not successful. How would you approach in this case?

Asked By: Seedizens

||

Answers:

I was able to get the price, but I got it from the product-price tag:

import json
from bs4 import BeautifulSoup
import requests
from multiprocessing import Pool
import pandas as pd

data = {'url':[],'offers_price':[]}

def get_price(url):
    soup = BeautifulSoup(requests.get(url, headers={'User-Agent': 'Mozilla/5.0'}).content, "html.parser")
    data = json.loads(soup.find_all('product-price')[-1]['data-model'])
    return url, int(data['fullPrice'])

if __name__ == '__main__':

    urls = ['https://www.cartier.com/en-gb/love-bracelet-small-model_cod25372685655708131.html#dept=EU_Love']

    with Pool(processes=4) as pool:
            for url, price in pool.imap_unordered(get_price, urls):
                    data['offers_price'].append(price)
                    data['url'].append(url)
    print(data)

Output:

{'url': ['https://www.cartier.com/en-gb/love-bracelet-small-model_cod25372685655708131.html#dept=EU_Love'], 'offers_price': [4100]}

By the way, are you sure you want to append the url and the price? I think you should do this instead:

data['offers_price'] = price
data['url'] = url
Answered By: Joan Lara

You can also do it with regular expressions, extracting the necessary information from inline JSON.

In order to extract data from inline JSON you need:

  1. open page source CTRL + U;
  2. find the data (price, title etc.) CTRL + F;
  3. using regular expression to extract parts of the inline JSON:
# https://regex101.com/r/EPJoTk/1 
portion_of_script = re.findall("[{"@context":(.*)", str(all_script))

After we extract the price:

# https://regex101.com/r/az0sSf/1
currency = re.search(""priceCurrency":"(.*?)"", str(portion_of_script)).group(1)

# https://regex101.com/r/ngCxwm/1
price = re.search(""price":"(.*?)"", str(portion_of_script)).group(1)

Also, if it will be useful for you, I have an answer to the question about scraping cartier.com with pagination.

Check code in the online IDE.

from bs4 import BeautifulSoup
import requests, re, lxml

# https://requests.readthedocs.io/en/latest/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36",
}
   
page = requests.get("https://www.cartier.com/en-gb/love-bracelet-small-model_cod25372685655708131.html#dept=EU_Love", headers=headers, timeout=30)
soup = BeautifulSoup(page.text, "lxml")
all_script = soup.select("script")

# https://regex101.com/r/EPJoTk/1 
portion_of_script = re.findall("[{"@context":(.*)", str(all_script))

# https://regex101.com/r/az0sSf/1
currency = re.search(""priceCurrency":"(.*?)"", str(portion_of_script)).group(1)

# https://regex101.com/r/ngCxwm/1
price = re.search(""price":"(.*?)"", str(portion_of_script)).group(1)

url = re.search(""url":"(.*?)"", str(portion_of_script)).group(1)

print(currency, price, url, sep="n")

Output:

GBP
4250
https://www.cartier.com/en-gb/love-bracelet-small-model_cod25372685655708131.html
Answered By: Denis Skopa