How to scrape all results from Google search results pages (Python/Selenium ChromeDriver)

Question

I am working on a Python script using selenium chromedriver to scrape all google search results (link, header, text) off a specified number of results pages.

The code I have seems to only be scraping the first result from all pages after the first page.
I think this has something to do with how my for-loop is set up in the scrape function, but I have not been able to tweak it into working the way I’d like it to. Any suggestions for how to fix/ better approach this appreciated.

# create instance of webdriver
driver = webdriver.Chrome()
url = 'https://www.google.com'
driver.get(url)

# set keyword
keyword = 'cars' 
# we find the search bar using it's name attribute value
searchBar = driver.find_element_by_name('q')
# first we send our keyword to the search bar followed by the ent
searchBar.send_keys(keyword)
searchBar.send_keys('n')

def scrape():
   pageInfo = []
   try:
      # wait for search results to be fetched
      WebDriverWait(driver, 10).until(
      EC.presence_of_element_located((By.CLASS_NAME, "g"))
      )
    
   except Exception as e:
      print(e)
      driver.quit()
   # contains the search results
   searchResults = driver.find_elements_by_class_name('g')
   for result in searchResults:
       element = result.find_element_by_css_selector('a')
       link = element.get_attribute('href')
       header = result.find_element_by_css_selector('h3').text
       text = result.find_element_by_class_name('IsZvec').text
       pageInfo.append({
           'header' : header, 'link' : link, 'text': text
       })
       return pageInfo

# Number of pages to scrape
numPages = 5
# All the scraped data
infoAll = []
# Scraped data from page 1
infoAll.extend(scrape())

for i in range(0 , numPages - 1):
   nextButton = driver.find_element_by_link_text('Next')
   nextButton.click()
   infoAll.extend(scrape())

print(infoAll)

Asked By: gdr1738

||

Source

Answer 1

You have an indentation problem:

You should to have return pageInfo outside for loop, otherwise you’re returning results after first loop execution

for result in searchResults:
       element = result.find_element_by_css_selector('a')
       link = element.get_attribute('href')
       header = result.find_element_by_css_selector('h3').text
       text = result.find_element_by_class_name('IsZvec').text
       pageInfo.append({
           'header' : header, 'link' : link, 'text': text
       })
       return pageInfo

Like this:

for result in searchResults:
       element = result.find_element_by_css_selector('a')
       link = element.get_attribute('href')
       header = result.find_element_by_css_selector('h3').text
       text = result.find_element_by_class_name('IsZvec').text
       pageInfo.append({
           'header' : header, 'link' : link, 'text': text
       })
return pageInfo

I’ve ran your code and got results:

[{‘header’: ‘Cars (film) — Wikipédia’, ‘link’: ‘https://fr.wikipedia.org/wiki/Cars_(film)’, ‘text’: "Cars : Quatre Roues, ou Les Bagnoles au Québec (Cars), est le septième long-métrage d’animation entièrement en images de synthèse des studios Pixar.nPays d’origine : États-UnisnDurée : 116 minutesnSociétés de production : Pixar Animation StudiosnGenre : AnimationnCars 2 · Michel Fortin · Flash McQueen"}, {‘header’: ‘Cars – Wikipedia, la enciclopedia libre’, ‘link’: ‘https://es.wikipedia.org/wiki/Cars’, ‘text’: ‘Cars es una película de animación por computadora de 2006, producida por Pixar Animation Studios y lanzada por Walt Disney Studios Motion Pictures.nAño : 2006nGénero : Animación; Aventuras; Comedia; Infa…nHistoria : John Lasseter Joe Ranft Jorgen Klubi…nProductora : Walt Disney Pictures; Pixar Animat…’}, {‘header’: ”, ‘link’: ‘https://fr.wikipedia.org/wiki/Flash_McQueen’, ‘text’: ”}, {‘header’: ”, ‘link’: ‘https://www.allocine.fr/film/fichefilm-55774/secrets-tournage/’, ‘text’: ”}, {‘header’: ”, ‘link’: ‘https://fr.wikipedia.org/wiki/Martin_(Cars)’, ‘text’: ”},

Suggestions:

Use a timer to control your for loop, otherwise you could be banned by Google due to suspicious activity

Steps:
1.- Import sleep from time: from time import sleep
2.- On your last loop add a timer:

for i in range(0 , numPages - 1):
    sleep(5) #It'll wait 5 seconds for each iteration
    nextButton = driver.find_element_by_link_text('Next')
    nextButton.click()
    infoAll.extend(scrape())

Answered By: Alberto Castillo

Answer 2

Google Search can be parsed with BeautifulSoup web scraping library without selenium, since the data is not being loaded dynamically via JavaScript, and will execute much faster in comparison to selenium as there’s no need to render the page and use browser.

In order to get information from all pages, you can use pagination using an infinite while loop. Try to avoid using for i in range() pagination as it is a hardcoded way of doing pagination thus not reliable. If the page number would change (from 5 to 20), pagination will be broken.

Since the while loop is infinite, you need to set the conditions for exiting it, you can make two conditions:

the exit condition will be the presence of a button to switch to the next page (it is not on the last page), the presence can be checked by its CSS selector (in our case – ".d6cvqb a[id=pnnext]")

# condition for exiting the loop in the absence of the next page button
    if soup.select_one(".d6cvqb a[id=pnnext]"):
        params["start"] += 10
    else:
        break

another solution would be to add a limit of pages available for scraping if there is no need to extract all the pages.

# condition for exiting the loop when the page limit is reached
    if page_num == page_limit:
        break

When trying to request a site, it may think that this is a bot, so that this does not happen, you need to send headers that contain user-agent in the request, then the site will assume that you are a user and display the information.

Next step could be to rotate user-agent, for example, to switch between PC, mobile, and tablet, as well as between browsers e.g. Chrome, Firefox, Safari, Edge and so on. The most reliable way is to use rotating proxies, user-agents, and a captcha solver.

Check full code in the online IDE.

from bs4 import BeautifulSoup
import requests, json, lxml

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "q": "cars",         # query example
    "hl": "en",          # language
    "gl": "uk",          # country of the search, UK -> United Kingdom
    "start": 0,          # number page by default up to 0
    #"num": 100          # parameter defines the maximum number of results to return.
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"
}

page_limit = 10                # page limit for example

page_num = 0

data = []

while True:
    page_num += 1
    print(f"page: {page_num}")
        
    html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
    soup = BeautifulSoup(html.text, 'lxml')
    
    for result in soup.select(".tF2Cxc"):
        title = result.select_one(".DKV0Md").text
        try:
           snippet = result.select_one(".lEBKkf span").text
        except:
           snippet = None
        links = result.select_one(".yuRUbf a")["href"]
      
        data.append({
          "title": title,
          "snippet": snippet,
          "links": links
        })
    # condition for exiting the loop when the page limit is reached
    if page_num == page_limit:
        break
    # condition for exiting the loop in the absence of the next page button
    if soup.select_one(".d6cvqb a[id=pnnext]"):
        params["start"] += 10
    else:
        break
print(json.dumps(data, indent=2, ensure_ascii=False))

Example output:

[
  {
    "title": "Cars (2006) - IMDb",
    "snippet": "On the way to the biggest race of his life, a hotshot rookie race car gets stranded in a rundown town, and learns that winning isn't everything in life.",
    "links": "https://www.imdb.com/title/tt0317219/"
  },
  {
    "title": "Cars (film) - Wikipedia",
    "snippet": "Cars is a 2006 American computer-animated sports comedy film produced by Pixar Animation Studios and released by Walt Disney Pictures. The film was directed ...",
    "links": "https://en.wikipedia.org/wiki/Cars_(film)"
  },
  {
    "title": "Cars - Rotten Tomatoes",
    "snippet": "Cars offers visual treats that more than compensate for its somewhat thinly written story, adding up to a satisfying diversion for younger viewers.",
    "links": "https://www.rottentomatoes.com/m/cars"
  },
  other results ...
]

Also you can use Google Search Engine Results API from SerpApi. It’s a paid API with a free plan.
The difference is that it will bypass blocks (including CAPTCHA) from Google, no need to create the parser and maintain it.

Code example:

from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import json, os

params = {
  "api_key": "...",                  # serpapi key from https://serpapi.com/manage-api-key
  "engine": "google",                # serpapi parser engine
  "q": "cars",                       # search query
  "gl": "uk",                        # country of the search, UK -> United Kingdom
  "num": "100"                       # number of results per page (100 per page in this case)
  # other search parameters: https://serpapi.com/search-api#api-parameters
}

search = GoogleSearch(params)      # where data extraction happens

page_limit = 10
organic_results_data = []
page_num = 0

while True:
    results = search.get_dict()    # JSON -> Python dictionary
    
    page_num += 1
    
    for result in results["organic_results"]:
        organic_results_data.append({
            "title": result.get("title"),
            "snippet": result.get("snippet"),
            "link": result.get("link")
        })

    if page_num == page_limit:
        break
      
    if "next_link" in results.get("serpapi_pagination", []):
        search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next_link")).query)))
    else:
        break
    
print(json.dumps(organic_results_data, indent=2, ensure_ascii=False))

Output:

[
  {
    "title": "Rally Cars - Page 30 - Google Books result",
    "snippet": "Some people say rally car drivers are the most skilled racers in the world. Roger Clark, a British rally legend of the 1970s, describes sliding his car down ...",
    "link": "https://books.google.co.uk/books?id=uIOlAgAAQBAJ&pg=PA30&lpg=PA30&dq=cars&source=bl&ots=9vDWFi0bHD&sig=ACfU3U1d4R-ShepjsTtWN-b9SDYkW1sTDQ&hl=en&sa=X&ved=2ahUKEwjPv9axu_b8AhX9LFkFHbBaB8c4yAEQ6AF6BAgcEAM"
  },
  {
    "title": "Independent Sports Cars - Page 5 - Google Books result",
    "snippet": "The big three American auto makers produced sports and sports-like cars beginning with GMs Corvette and Fords Thunderbird in 1954. Folowed by the Mustang, ...",
    "link": "https://books.google.co.uk/books?id=HolUDwAAQBAJ&pg=PA5&lpg=PA5&dq=cars&source=bl&ots=yDaDtQSyW1&sig=ACfU3U11nHeRTwLFORGMHHzWjaVHnbLK3Q&hl=en&sa=X&ved=2ahUKEwjPv9axu_b8AhX9LFkFHbBaB8c4yAEQ6AF6BAgaEAM"
  }
  other results...
]

Answered By: Denis Skopa

How to scrape all results from Google search results pages (Python/Selenium ChromeDriver)

Question:

Answers: