How to speedup BeautifulSoup web scraping project

Question:

I am working on a web scraping project willing to take prices from a website using different urls. I have run the following code but it takes so long to print the price number. I am using PyCharm on a MacBook Pro 13” i5 (2020) 1.4 GHz and 8GB RAM, if this can help.

import ssl
import bs4
from urllib.request import Request, urlopen
import json

#to avoid SSL verification
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

#Define the url to monitor
urls = ['https://www.tiffany.co.uk/jewelry/necklaces-pendants/tiffany-hardwear-graduated-link-necklace-63008966/', 'https://www.tiffany.co.uk/jewelry/necklaces-pendants/tiffany-t-smile-pendant-35189459/']

for i in urls:

#Open the url to monitor using a new user agent to avoid website blocks you
    req = Request(
        url=i,
        headers={'User-Agent': 'Mozilla/5.0'}
    )

    #Read the HTML code of the url
    webpage = urlopen(req, context=ctx).read()
    soup = bs4.BeautifulSoup(webpage, "html.parser")

    #Define the HTML element we need to screen and find prices (this time using Javascript)
    data = json.loads(soup.find_all('script', {'type': 'application/ld+json'})[-1].get_text())
    price = int(data['offers']['price'])

    print(price)

Using only one url, the code works, but adding other urls and a simple for loop, it takes a while. How could I speed up the process? Thanks a lot!

Asked By: Seedizens

||

Answers:

You can speed up the processing using multi-threading or multi-processing. This example will use multiprocessing module (with Pool of 4 processes) to obtain the prices:

import json
from bs4 import BeautifulSoup
import requests
from multiprocessing import Pool


def get_price(url):
    soup = BeautifulSoup(requests.get(url, headers={'User-Agent': 'Mozilla/5.0'}).content, "html.parser")
    data = json.loads(soup.find_all('script', {'type': 'application/ld+json'})[-1].get_text())
    return url, int(data['offers']['price'])

if __name__ == '__main__':

    urls = ['https://www.tiffany.co.uk/jewelry/necklaces-pendants/tiffany-hardwear-graduated-link-necklace-63008966/', 'https://www.tiffany.co.uk/jewelry/necklaces-pendants/tiffany-t-smile-pendant-35189459/']

    with Pool(processes=4) as pool:
        for url, price in pool.imap_unordered(get_price, urls):
            print(url, price)

Prints (for example, the order could vary):

https://www.tiffany.co.uk/jewelry/necklaces-pendants/tiffany-t-smile-pendant-35189459/ 920
https://www.tiffany.co.uk/jewelry/necklaces-pendants/tiffany-hardwear-graduated-link-necklace-63008966/ 13900
Answered By: Andrej Kesely