I’ve written a script in python in combination with selenium to scrape the links of different posts from its landing page and finally get the title of each post by tracking the url leading to its inner page. Although the content I parsed here are static ones, I used selenium to see how it works in multiprocessing.
However, my intention is to do the scraping using multiprocessing. So far I know that selenium doesn’t support multiprocessing but it seems I was wrong.
My question: how can I reduce the execution time using selenium when it is made to run using multiprocessing?
This is my try (it's a working one):
import requests from urllib.parse import urljoin from multiprocessing.pool import ThreadPool from bs4 import BeautifulSoup from selenium import webdriver def get_links(link): res = requests.get(link) soup = BeautifulSoup(res.text,"lxml") titles = [urljoin(url,items.get("href")) for items in soup.select(".summary .question-hyperlink")] return titles def get_title(url): chromeOptions = webdriver.ChromeOptions() chromeOptions.add_argument("--headless") driver = webdriver.Chrome(chrome_options=chromeOptions) driver.get(url) sauce = BeautifulSoup(driver.page_source,"lxml") item = sauce.select_one("h1 a").text print(item) if __name__ == '__main__': url = "https://stackoverflow.com/questions/tagged/web-scraping" ThreadPool(5).map(get_title,get_links(url))
My question: how can I reduce the execution time?
For scraping tasks without much interaction, I have had good results using the opensource Scrapy Python package for large-scale scrapying tasks. It does multiprocessing out of the box, it is easy to write new scripts and store the data in files or a database — and it is really fast.
Your script would look something like this when implemented as a fully parallel Scrapy spider (note I did not test this, see documentation on selectors).
import scrapy class BlogSpider(scrapy.Spider): name = 'blogspider' start_urls = ['https://stackoverflow.com/questions/tagged/web-scraping'] def parse(self, response): for title in response.css('.summary .question-hyperlink'): yield title.get('href')
To run put this into
blogspider.py and run
$ scrapy runspider blogspider.py
See the Scrapy website for a complete tutorial.
how can I reduce the execution time using selenium when it is made to run using multiprocessing
A lot of time in your solution is spent on launching the webdriver for each URL. You can reduce this time by launching the driver only once per thread:
(... skipped for brevity ...) threadLocal = threading.local() def get_driver(): driver = getattr(threadLocal, 'driver', None) if driver is None: chromeOptions = webdriver.ChromeOptions() chromeOptions.add_argument("--headless") driver = webdriver.Chrome(chrome_options=chromeOptions) setattr(threadLocal, 'driver', driver) return driver def get_title(url): driver = get_driver() driver.get(url) (...) (...)
On my system this reduces the time from 1m7s to just 24.895s, a ~35% improvement. To test yourself, download the full script.
ThreadPool uses threads, which are constrained by the Python GIL. That’s ok if for the most part the task is I/O bound. Depending on the post-processing you do with the scraped results, you may want to use a
multiprocessing.Pool instead. This launches parallel processes which as a group are not constrained by the GIL. The rest of the code stays the same.
The one potential problem I see with the clever one-driver-per-thread answer is that it omits any mechanism for "quitting" the drivers and thus leaving the possibility of processes hanging around. I would make the following changes:
Driverthat will crate the driver instance and store it on the thread-local storage but also have a destructor that will
quitthe driver when the thread-local storage is deleted:
class Driver: def __init__(self): options = webdriver.ChromeOptions() options.add_argument("--headless") self.driver = webdriver.Chrome(options=options) def __del__(self): self.driver.quit() # clean up driver when we are cleaned up #print('The driver has been "quitted".')
threadLocal = threading.local() def create_driver(): the_driver = getattr(threadLocal, 'the_driver', None) if the_driver is None: the_driver = Driver() setattr(threadLocal, 'the_driver', the_driver) return the_driver.driver
ThreadPoolinstance but before it is terminated, add the following lines to delete the thread-local storage and force the
Driverinstances’ destructors to be called (hopefully):
del threadLocal import gc gc.collect() # a little extra insurance