How to scrape related searches on google?

Question:

I’m trying to scrape google for related searches when given a list of keywords, and then output these related searches into a csv file. My problem is getting beautiful soup to identify the related searches html tags.

Here is an example html tag in the source code:

<div data-ved="2ahUKEwitr8CPkLT3AhVRVsAKHVF-C80QmoICKAV6BAgEEBE">iphone xr</div>

Here are my webdriver settings:

from selenium import webdriver

user_agent = 'Chrome/100.0.4896.60'

webdriver_options = webdriver.ChromeOptions()
webdriver_options.add_argument('user-agent={0}'.format(user_agent))


capabilities = webdriver_options.to_capabilities()
capabilities["acceptSslCerts"] = True
capabilities["acceptInsecureCerts"] = True

Here is my code as is:

queries = ["iphone"]

driver = webdriver.Chrome(options=webdriver_options, desired_capabilities=capabilities, port=4444)

df2 = []

driver.get("https://google.com")
time.sleep(3)
driver.find_element(By.CSS_SELECTOR, "[aria-label='Agree to the use of cookies and other data for the purposes described']").click()

# get_current_related_searches
for query in queries:
    driver.get("https://google.com/search?q=" + query)
    time.sleep(3)
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    p = soup.find_all('div data-ved')
    print(p)
    d = pd.DataFrame({'loop': 1, 'source': query, 'from': query, 'to': [s.text for s in p]})
    terms = d["to"]
    df2.append(d)
    time.sleep(3)

df = pd.concat(df2).reset_index(drop=False)

df.to_csv("related_searches.csv")

Its the p=soup.find_all which is incorrect I’m just not sure how to get BS to identify these specific html tags. Any help would be great 🙂

Asked By: JakeCohenSol

||

Answers:

@jakecohensol, as you’ve pointed out, the selector in p = soup.find_all is wrong. The correct CSS selector: .y6Uyqe .AB4Wff.

Chrome/100.0.4896.60 User-Agent header is incorrect. Google blocks requests with such an agent string. With the full User-Agent string Google returns a proper HTML response.

Google Related Searches can be scraped without a browser. It will be faster and more reliable.

Here’s your fixed code snippet (link to the full code in online IDE)

import time
import requests
from bs4 import BeautifulSoup
import pandas as pd

headers = {
    "User-Agent": "Mozilla/5.0 (X11; CrOS x86_64 14526.89.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.133 Safari/537.36"
}

queries = ["iphone", "pixel", "samsung"]

df2 = []

# get_current_related_searches
for query in queries:
    params = {"q": query}
    response = requests.get("https://google.com/search", params=params, headers=headers)

    soup = BeautifulSoup(response.text, "html.parser")

    p = soup.select(".y6Uyqe .AB4Wff")

    d = pd.DataFrame(
        {"loop": 1, "source": query, "from": query, "to": [s.text for s in p]}
    )

    terms = d["to"]
    df2.append(d)

    time.sleep(3)

df = pd.concat(df2).reset_index(drop=False)

df.to_csv("related_searches.csv")

Sample output:

,index,loop,source,from,to
0,0,1,iphone,iphone,iphone 13
1,1,1,iphone,iphone,iphone 12
2,2,1,iphone,iphone,iphone x
3,3,1,iphone,iphone,iphone 8
4,4,1,iphone,iphone,iphone 7
5,5,1,iphone,iphone,iphone xr
6,6,1,iphone,iphone,find my iphone
7,0,1,pixel,pixel,pixel 6
8,1,1,pixel,pixel,google pixel
9,2,1,pixel,pixel,pixel phone
10,3,1,pixel,pixel,pixel 6 pro
11,4,1,pixel,pixel,pixel 3
12,5,1,pixel,pixel,google pixel price
13,6,1,pixel,pixel,pixel 6 release date
14,0,1,samsung,samsung,samsung galaxy
15,1,1,samsung,samsung,samsung tv
16,2,1,samsung,samsung,samsung tablet
17,3,1,samsung,samsung,samsung account
18,4,1,samsung,samsung,samsung mobile
19,5,1,samsung,samsung,samsung store
20,6,1,samsung,samsung,samsung a21s
21,7,1,samsung,samsung,samsung login
Answered By: Aza Voloshkina

Have a look at SelectorGadget Chrome extension to get CSS selector by clicking on desired element in your browser that returns a HTML element.

Check out what’s your user agent, or find multiple user agents for mobile, tablet, PC, or different OS in order to rotate user agents which reduces the chance of being blocked a little bit.

The ideal scenario is to combine rotating user agents with rotated proxies (ideally residential), and CAPTCHA solver to solve Google CAPTCHA that will appear eventually.

As an alternative, there’s a Google Search Engine Results API to scrape Google search results if you don’t want to figure out how to create and maintain the parser from scratch, or how bypass blocks from Google (or other search engines).

Example code to integrate:

import os
from serpapi import GoogleSearch

queries = [
    'banana',
    'minecraft',
    'apple stock',
    'how to create a apple pie'
]

def serpapi_scrape_related_queries():

    related_searches = []

    for query in queries:
        print(f'extracting related queries from query: {query}')

        params = {
            'api_key': os.getenv('API_KEY'),  # your serpapi api key
            'device': 'desktop',              # device to retrive results from
            'engine': 'google',               # serpapi parsing engine
            'q': query,                       # search query
            'gl': 'us',                       # country of the search
            'hl': 'en'                        # language of the search
        }

        search = GoogleSearch(params)         # where data extracts on the backend
        results = search.get_dict()           # JSON -> dict

        for result in results['related_searches']:
            query = result['query']
            link = result['link']

            related_searches.append({
                'query': query,
                'link': link
            })

    pd.DataFrame(data=related_searches).to_csv('serpapi_related_queries.csv', index=False)

serpapi_scrape_related_queries()

Part of the dataframe output:

             query                                               link
0  banana benefits  https://www.google.com/search?gl=us&hl=en&q=Ba...
1  banana republic  https://www.google.com/search?gl=us&hl=en&q=Ba...
2      banana tree  https://www.google.com/search?gl=us&hl=en&q=Ba...
3   banana meaning  https://www.google.com/search?gl=us&hl=en&q=Ba...
4     banana plant  https://www.google.com/search?gl=us&hl=en&q=Ba...
Answered By: Viktoria