How to handle large scale Web Scraping?

Question:

The Situation:

I recently started web scraping using selenium and scrapy and i was working on a project where i have a csv file which contains 42 thousand zip codes and my job is to take that zip code and go on this site input the zip code and scrape all the results.

The Problem:

The problem here is that in doing this I have to continuously click the ‘load more’ button until all the results have been displayed and only once that has finished I can collect the data.

This may not be much of an issue, however it takes 2 minutes to do this per zip code and I have 42 000 to do this with.

The Code:

    import scrapy
    from numpy.lib.npyio import load
    from selenium import webdriver
    from selenium.common.exceptions import ElementClickInterceptedException, ElementNotInteractableException, ElementNotSelectableException, NoSuchElementException, StaleElementReferenceException
    from selenium.webdriver.common.keys import Keys
    from items import CareCreditItem
    from datetime import datetime
    import os
    
    
    from scrapy.crawler import CrawlerProcess
    global pin_code
    pin_code = input("enter pin code")
    
    class CareCredit1Spider(scrapy.Spider):
        
        name = 'care_credit_1'
        start_urls = ['https://www.carecredit.com/doctor-locator/results/Any-Profession/Any-Specialty//?Sort=D&Radius=75&Page=1']
    
        def start_requests(self):
            
            directory = os.getcwd()
            options = webdriver.ChromeOptions()
            options.headless = True
    
            options.add_experimental_option("excludeSwitches", ["enable-logging"])
            path = (directory+r"\Chromedriver.exe")
            driver = webdriver.Chrome(path,options=options)
    
            #URL of the website
            url = "https://www.carecredit.com/doctor-locator/results/Any-Profession/Any-Specialty/" +pin_code + "/?Sort=D&Radius=75&Page=1"
            driver.maximize_window()
            #opening link in the browser
            driver.get(url)
            driver.implicitly_wait(200)
            
            try:
                cookies = driver.find_element_by_xpath('//*[@id="onetrust-accept-btn-handler"]')
                cookies.click()
            except:
                pass
    
            i = 0
            loadMoreButtonExists = True
            while loadMoreButtonExists:
                try:
                    load_more =  driver.find_element_by_xpath('//*[@id="next-page"]')
                    load_more.click()    
                    driver.implicitly_wait(30)
                except ElementNotInteractableException:
                    loadMoreButtonExists = False
                except ElementClickInterceptedException:
                    pass
                except StaleElementReferenceException:
                    pass
                except NoSuchElementException:
                    loadMoreButtonExists = False
    
            try:
                previous_page = driver.find_element_by_xpath('//*[@id="previous-page"]')
                previous_page.click()
            except:
                pass
    
            name = driver.find_elements_by_class_name('dl-result-item')
            r = 1
            temp_list=[]
            j = 0
            for element in name:
                link = element.find_element_by_tag_name('a')
                c = link.get_property('href')
                yield scrapy.Request(c)
    
        def parse(self, response):
            item = CareCreditItem()
            item['Practise_name'] = response.css('h1 ::text').get()
            item['address'] = response.css('.google-maps-external ::text').get()
            item['phone_no'] = response.css('.dl-detail-phone ::text').get()
            yield item
    now = datetime.now()
    dt_string = now.strftime("%d/%m/%Y")
    dt = now.strftime("%H-%M-%S")
    file_name = dt_string+"_"+dt+"zip-code"+pin_code+".csv"
    process = CrawlerProcess(settings={
        'FEED_URI' : file_name,
        'FEED_FORMAT':'csv'
    })
    process.crawl(CareCredit1Spider)
    process.start()
    print("CSV File is Ready")

items.py


    import scrapy

    class CareCreditItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        Practise_name = scrapy.Field()
        address = scrapy.Field()
        phone_no = scrapy.Field()

The Question:

Essentially my question is simple. Is there a way to optimize this code in order for it to perform faster? Or what are the other potential methods in order to handle scraping this data without it taking forever?

Asked By: Samyak jain

||

Answers:

Since the site loads the data dynamically from an api you can retrieve the data directly from the api. This will speed things up quite a bit, but I’d still implement a wait to avoid hitting the rate limit.

import requests
import time
import pandas as pd

zipcode = '00704'
radius = 75
url = f'https://www.carecredit.com/sites/ContentServer?d=&pagename=CCGetLocatorService&Zip={zipcode}&City=&State=&Lat=&Long=&Sort=D&Radius={radius}&PracticePhone=&Profession=&location={zipcode}&Page=1'
req = requests.get(url)
r = req.json()
data = r['results']

for i in range(2,r['maxPage']+1):
    url = f'https://www.carecredit.com/sites/ContentServer?d=&pagename=CCGetLocatorService&Zip={zipcode}&City=&State=&Lat=&Long=&Sort=D&Radius={radius}&PracticePhone=&Profession=&location={zipcode}&Page={i}'
    req = requests.get(url)
    r = req.json()
    data.extend(r['results'])
    time.sleep(1)

df = pd.DataFrame(data)
df.to_csv(f'{pd.Timestamp.now().strftime("%d/%m/%Y_%H-%M-%S")}zip-code{zipcode}.csv')
Answered By: RJ Adriaansen

There are multiple ways in which you can do this.

1. Creating a distributed system in which you run the spider through multiple machines in order to run in parallel.

This in my opinio is the better of the options as you can also create a scalable dynamic solution that you will be able to use many times over.

There are many ways of doing this normally it will consist of dividing the seedlist (The Zip Codes) into many separate seedlists in order to have the separate processes working with seperate seedlists, thus the downloads will run in parallel so for example if its on 2 machines it will go 2 times faster, but if on 10 machines its 10 times faster, etc.

In order to do this I might suggest looking into AWS, namely AWS Lambda , AWS EC2 Instances or even AWS Spot Instances these are the ones I have worked wiht previously and they are not terribly hard to work with.

2. Alternatively, if you are wanting to run it on a single machine you can take a look into Multithreading with Python, which can help you run the process in parallel on the singular machine.

3. This is another option particularly if it is a once off process. You can try running it simply with requests which may speed it up but with a massive amount of seeds it usually is faster to develop a process running in parallel.

Answered By: Kwsswart

As far as RJ Adriaansen approach is valid and most effective in this particular case I would like to stress that Scrapy is the go-to framework for such tasks due to its: speed and ability to be hosted in the cloud. Thus, I will post a solution made with Scrapy and provide some additional options to use headless browser as an API to increase speed and scale for JS heavy scraping jobs.

This is how the code will look for your specific example using Scrapy and API calls for this site.

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request
from scrapy.shell import inspect_response
import json



class CodeSpider(scrapy.Spider):
    name = "code"
    count = 0
    def start_requests(self):
        zip_codes = ['00704','00705','00706']
        radius = 75
        
        for zip in zip_codes:
            url = 'https://www.carecredit.com/sites/ContentServer?d=&pagename=CCGetLocatorService&Zip={}&City=&State=&Lat=&Long=&Sort=D&Radius={}&PracticePhone=&Profession=&location={}&Page=1'.format(zip, radius, zip)
            yield  Request(url)
    
    
    
    
    
    
    def parse(self, response):
        d = json.loads(response.text)
        
        # Check if response returns any results
        if 'results' in d:
            data = d['results']
            
            for i in data:
                
                yield {
                       'Practise_name': i['name'],
                       'address': i['address1'],
                       'phone_no': i['phone'],
                       }
        
        # Check if response returns pagination key
        if 'maxPage' in d:
            pagination = (2,d['maxPage']+1)
            for page in pagination:
                url = response.url.replace('Page=1','Page={}'.format(page))
                yield Request(url, callback = self.parse)

The code will return this as a result:

{
  'Practise_name': 'Walgreens 00973',
  'address': 'Eljibaro Ave And Pr 172',
  'phone_no': '(787) 739-4386'
}

Obviously, you can save it as JSON or CSV. Its up to your needs. According to Scrapy help:

> ## Serialization formats[ΒΆ](https://docs.scrapy.org/en/latest/topics/feed-exports.html#serialization-formats
> "Permalink to this heading")
> 
> For serializing the scraped data, the feed exports use the  [Item
> exporters](https://docs.scrapy.org/en/latest/topics/exporters.html#topics-exporters).
> These formats are supported out of the box:
> 
> -   [JSON](https://docs.scrapy.org/en/latest/topics/feed-exports.html#topics-feed-format-json)
>     
> -   [JSON lines](https://docs.scrapy.org/en/latest/topics/feed-exports.html#topics-feed-format-jsonlines)
>     
> -   [CSV](https://docs.scrapy.org/en/latest/topics/feed-exports.html#topics-feed-format-csv)
>     
> -   [XML](https://docs.scrapy.org/en/latest/topics/feed-exports.html#topics-feed-format-xml)
>     
> 
> But you can also extend the supported format through the 
> [`FEED_EXPORTERS`](https://docs.scrapy.org/en/latest/topics/feed-exports.html#std-setting-FEED_EXPORTERS)
> setting.

Here are some API services that support Headless Browsers to allow your Scrapy spider to run much faster when dealing with JavaScript heavy websites:

  1. Web Unlocker – waterfall proxy management, JS rendering, CAPTCHA solver, browser fingerprint optimization. Premium solution.
  2. Scrapingbee API – proxy, JS rendering, provides free trial and nice docs.
  3. Scrapy Splash – scrapy native headless solution. Quite hard to use and expensive.
  4. PhantoJsCloud – proxy, JS rendering, screenshots, etc. Provides free trial. Docs could be more user-friendly.
  5. ScraperApi – proxy, JS rendering, provides free trial and nice docs.
  6. Scrapy selenium middlwear – free native middleware to enhance Scrapy headless experience.
Answered By: Gidoneli