Python scraper won't complete

Question:

I am using this code to scrape emails from Google search results. However, it only scrapes the first 10 results, despite having 100 search results loaded.

Ideally, I would like for it to scrape all search results.

Is there a reason for this?

from selenium import webdriver
import time
import re
import pandas as pd

PATH = 'C:Program Files (x86)chromedriver.exe'

l = list()
o = {}

target_url = "https://www.google.com/search?q=solicitors+wales+%27email%27+%40&rlz=1C1CHBD_en-GBIT1013IT1013&sxsrf=AJOqlzWC1oRbVtWcmcIgC4-3ZnGkQ8sP_A%3A1675764565222&ei=VSPiY6WeDYyXrwStyaTwAQ&ved=0ahUKEwjlnIy9lYP9AhWMy4sKHa0kCR4Q4dUDCA8&uact=5&oq=solicitors+wales+%27email%27+%40&gs_lcp=Cgxnd3Mtd2l6LXNlcnAQAzIFCAAQogQyBwgAEB4QogQyBQgAEKIESgQIQRgASgQIRhgAUABYAGD4AmgAcAF4AIABc4gBc5IBAzAuMZgBAKABAcABAQ&sclient=gws-wiz-serp"

driver = webdriver.Chrome(PATH)

driver.get(target_url)

email_pattern = r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+.[A-Z|a-z]{2,4}"
html = driver.page_source
emails = re.findall(email_pattern, html)

time.sleep(10)
df = pd.DataFrame(emails, columns=['Email Addresses'])
df.to_excel('email_addresses_.xlsx', index=False)
#print(emails)
driver.close()
Asked By: someone

||

Answers:

The code is working as expected and scraping 10 results which is the default from Google Search. You can use the methods like ‘find_element_by_xpath’ to find the next button and click it.

This operation needs to be done till the sufficient results are collected in loop. Refer this for more details: Selenium locating elements.

How can you use the Selenium commands? Probably you can look up on the web. I found one similar question which can provide some references.

Answered By: Bijendra

Following up on Bijendra’s answer, you could update the code as below:

from selenium import webdriver
from selenium.webdriver.common.by import By
import time
import re
import pandas as pd


PATH = 'C:Program Files (x86)chromedriver.exe'

l = list()
o = {}

target_url = "https://www.google.com/search?q=solicitors+wales+%27email%27+%40&rlz=1C1CHBD_en-GBIT1013IT1013&sxsrf=AJOqlzWC1oRbVtWcmcIgC4-3ZnGkQ8sP_A%3A1675764565222&ei=VSPiY6WeDYyXrwStyaTwAQ&ved=0ahUKEwjlnIy9lYP9AhWMy4sKHa0kCR4Q4dUDCA8&uact=5&oq=solicitors+wales+%27email%27+%40&gs_lcp=Cgxnd3Mtd2l6LXNlcnAQAzIFCAAQogQyBwgAEB4QogQyBQgAEKIESgQIQRgASgQIRhgAUABYAGD4AmgAcAF4AIABc4gBc5IBAzAuMZgBAKABAcABAQ&sclient=gws-wiz-serp"

driver = webdriver.Chrome(PATH)

driver.get(target_url)
emails = []
email_pattern = r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+.[A-Z|a-z]{2,4}"
for i in range(2):
    html = driver.page_source
    for e in re.findall(email_pattern, html):
        emails.append(e)
    a_attr = driver.find_element(By.ID, "pnnext")
    a_attr.click()

time.sleep(2)
df = pd.DataFrame(emails, columns=['Email Addresses'])
df.to_csv('email_addresses_.csv', index=False)
driver.close()

You could either change the range value passed in the for loop or entirely replace the for loop with a while loop, so instead of

for i in range(2):

You could do:

while len(emails) < 100:

Make sure to manage the time as to when the page navigates to next page and wait for the next page to load before extracting the available emails and then moving on to clicking the next button on the search result page.

Make sure to refer to the documentation to get a clear idea of what you should do to achieve what you want to.

Answered By: Vee Dee

Selenium loads its own empty browser so your Google settings for 100 results need to be in the code, because the default is 10 results which is what your getting. You will have better luck using query parameters and adding the one for the number of results to the end of your URL

If you need further information on query parameters to achieve this, it’s the second method described in How to Show 100 Results Per Page in Google Search.

Answered By: dot