Optimising Python script for scraping to avoid getting blocked/ draining resources

Question:

I have a fairly basic Python script that scrapes a property website, and stores the address and price in a csv file. There are over 5000 listings to go through but I find my current code times out after a while (about 2000 listings) and the console shows 302 and CORS policy errors.

import requests
import itertools
from bs4 import BeautifulSoup
from csv import writer
from random import randint
from time import sleep
from datetime import date


url = "https://www.propertypal.com/property-for-sale/northern-ireland/page-"
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'}
filename = date.today().strftime("ni-listings-%Y-%m-%d.csv")

with open(filename, 'w', encoding='utf8', newline='') as f:
    thewriter = writer(f)
    header = ['Address', 'Price']
    thewriter.writerow(header)

    # for page in range(1, 3):
    for page in itertools.count(1):
        req = requests.get(f"{url}{page}", headers=headers)
        soup = BeautifulSoup(req.content, 'html.parser')

        for li in soup.find_all('li', class_="pp-property-box"):
            title = li.find('h2').text
            price = li.find('p', class_="pp-property-price").text

            info = [title, price]
            thewriter.writerow(info)

        sleep(randint(1, 5))

# this script scrapes all pages and records all listings and their prices in daily csv

As you can see I added sleep(randint(1, 5)) to add random intervals but I possibly need to do more. Of course I want to scrape the page in its entirety as quickly as possible but I also want to be respectful to the site that is being scraped and minimise burdening them.

Can anyone suggest updates? Ps forgive rookie errors, very new to Python/scraping!

Asked By: cts

||

Answers:

This is one way of getting that data – bear in mind there are 251 pages only, with 12 properties on each of them, not over 5k:

import requests
import pandas as pd
from tqdm import tqdm
from bs4 import BeautifulSoup as bs

pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36',
    'accept': 'application/json',
    'accept-language': 'en-US,en;q=0.9',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-site': 'same-origin'
}
s = requests.Session()
s.headers.update(headers)
big_list = []
for x in tqdm(range(1, 252)):
    soup = bs(s.get(f'https://www.propertypal.com/property-for-sale/northern-ireland/page-{x}').text, 'html.parser')
#     print(soup)
    properties = soup.select('li.pp-property-box')
    for p in properties:
        name = p.select_one('h2').get_text(strip=True) if p.select_one('h2') else None
        url = 'https://www.propertypal.com' + p.select_one('a').get('href') if p.select_one('a') else None
        price = p.select_one('p.pp-property-price').get_text(strip=True) if p.select_one('p.pp-property-price') else None
        big_list.append((name, price, url))
big_df = pd.DataFrame(big_list, columns = ['Property', 'Price', 'Url'])
print(big_df)

Result printed in terminal:

100%
251/251 [03:41<00:00, 1.38it/s]
Property    Price   Url
0   22 Erinvale Gardens, Belfast, BT10 0FS  Asking price£165,000    https://www.propertypal.com/22-erinvale-gardens-belfast/777820
1   Laurel Hill, 37 Station Road, Saintfield, BT24 7DZ  Guide price£725,000 https://www.propertypal.com/laurel-hill-37-station-road-saintfield/751274
2   19 Carrick Brae, Burren Warrenpoint, Newry, BT34 3TH    Guide price£265,000 https://www.propertypal.com/19-carrick-brae-burren-warrenpoint-newry/775302
3   7b Conway Street, Lisburn, BT27 4AD Offers around£299,950   https://www.propertypal.com/7b-conway-street-lisburn/779833
4   Hartley Hall, Greenisland   From£280,000to£397,500  https://www.propertypal.com/hartley-hall-greenisland/d850
... ... ... ...
3007    8 Shimna Close, Newtownards, BT23 4PE   Offers around£99,950    https://www.propertypal.com/8-shimna-close-newtownards/756825
3008    7 Barronstown Road, Dromore, BT25 1NT   Guide price£380,000 https://www.propertypal.com/7-barronstown-road-dromore/756539
3009    39 Tamlough Road, Randalstown, BT41 3DP Offers around£425,000   https://www.propertypal.com/39-tamlough-road-randalstown/753299
3010    Glengeen House, 17 Carnalea Road, Fintona, BT78 2BY Offers over£180,000 https://www.propertypal.com/glengeen-house-17-carnalea-road-fintona/750105
3011    Walnut Road, Larne, BT40 2WE    Offers around£169,950   https://www.propertypal.com/walnut-road-larne/749733
3012 rows × 3 columns

See relevant documentation for Requests: https://requests.readthedocs.io/en/latest/

For Pandas: https://pandas.pydata.org/docs/

For BeautifulSoup: https://beautiful-soup-4.readthedocs.io/en/latest/

And for TQDM: https://pypi.org/project/tqdm/

Answered By: Barry the Platipus