Multithreading / Multiprocessing in Selenium

Question:

I wrote a python script that scrapes the urls from a text file and prints out the href from an element. However my goal here is to make it faster being able to do it on a larger scale with Multiprocessing or Multithreading.

In the workflow each browser process would get the href from the current url and load the next link from the que in the same browser istance (let’s say there are 5). Of couse each link should get scraped 1 time.

Example input File: HNlinks.txt

https://news.ycombinator.com/user?id=ingve
https://news.ycombinator.com/user?id=dehrmann
https://news.ycombinator.com/user?id=thanhhaimai
https://news.ycombinator.com/user?id=rbanffy
https://news.ycombinator.com/user?id=raidicy
https://news.ycombinator.com/user?id=svenfaw
https://news.ycombinator.com/user?id=ricardomcgowan

Code:

from selenium import webdriver

driver = webdriver.Chrome()
input1 = open("HNlinks.txt", "r")
urls1 = input1.readlines()

for url in urls1:
    driver.get(url)

    links=driver.find_elements_by_class_name('athing')
    for link in links:
        print(link.find_element_by_css_selector('a').get_attribute("href"))
Asked By: request

||

Answers:

Using multiprocessing*

Note: I have not test-run this answer locally.
Please try and give feedback:

from multiprocessing import Pool
from selenium import webdriver

input1 = open("HNlinks.txt", "r")
urls1 = input1.readlines()

def load_url(url):
    driver = webdriver.Chrome()
    driver.get(url)
    links=driver.find_elements_by_class_name('athing')
    for link in links:
        print(link.find_element_by_css_selector('a').get_attribute("href"))

if __name__ == "__main__":
    # how many concurrent processes do you want to span? this is also limited by 
    # the total number of cores that your computer has.
    processes = len(urls1)
    p = Pool(processes ) 
    p.map(load_url, urls1)
    p.close()
    p.join()
Answered By: Edison