Google Search Web Scraping with Python
Question:
I’ve been learning a lot of python lately to work on some projects at work.
Currently I need to do some web scraping with google search results. I found several sites that demonstrated how to use ajax google api to search, however after attempting to use it, it appears to no longer be supported. Any suggestions?
I’ve been searching for quite a while to find a way but can’t seem to find any solutions that currently work.
Answers:
You can always directly scrape Google results. To do this, you can use the URL https://google.com/search?q=<Query>
this will return the top 10 search results.
Then you can use lxml for example to parse the page. Depending on what you use, you can either query the resulting node tree via a CSS-Selector (.r a
) or using a XPath-Selector (//h3[@class="r"]/a
)
In some cases the resulting URL will redirect to Google. Usually it contains a query-parameter q
which will contain the actual request URL.
Example code using lxml and requests:
from urllib.parse import urlencode, urlparse, parse_qs
from lxml.html import fromstring
from requests import get
raw = get("https://www.google.com/search?q=StackOverflow").text
page = fromstring(raw)
for result in page.cssselect(".r a"):
url = result.get("href")
if url.startswith("/url?"):
url = parse_qs(urlparse(url).query)['q']
print(url[0])
A note on google banning your IP: In my experience, google only bans
if you start spamming google with search requests. It will respond
with a 503 if Google thinks you are bot.
You can also use a third party service like Serp API – I wrote and run this tool – that is a paid Google search engine results API. It solves the issues of being blocked, and you don’t have to rent proxies and do the result parsing yourself.
It’s easy to integrate with Python:
from lib.google_search_results import GoogleSearchResults
params = {
"q" : "Coffee",
"location" : "Austin, Texas, United States",
"hl" : "en",
"gl" : "us",
"google_domain" : "google.com",
"api_key" : "demo",
}
query = GoogleSearchResults(params)
dictionary_results = query.get_dictionary()
GitHub: https://github.com/serpapi/google-search-results-python
Here is another service that can be used for scraping SERPs (https://zenserp.com) It does not require a client and is cheaper.
Here is a python code sample:
import requests
headers = {
'apikey': '',
}
params = (
('q', 'Pied Piper'),
('location', 'United States'),
('search_engine', 'google.com'),
('language', 'English'),
)
response = requests.get('https://app.zenserp.com/api/search', headers=headers, params=params)
You have 2 options. Building it yourself or using a SERP API.
A SERP API will return the Google search results as a formatted JSON response.
I would recommend a SERP API as it is easier to use, and you don’t have to worry about getting blocked by Google.
1. SERP API
I have good experience with the scraperbox serp api.
You can use the following code to call the API. Make sure to replace YOUR_API_TOKEN
with your scraperbox API token.
import urllib.parse
import urllib.request
import ssl
import json
ssl._create_default_https_context = ssl._create_unverified_context
# Urlencode the query string
q = urllib.parse.quote_plus("Where can I get the best coffee")
# Create the query URL.
query = "https://api.scraperbox.com/google"
query += "?token=%s" % "YOUR_API_TOKEN"
query += "&q=%s" % q
query += "&proxy_location=gb"
# Call the API.
request = urllib.request.Request(query)
raw_response = urllib.request.urlopen(request).read()
raw_json = raw_response.decode("utf-8")
response = json.loads(raw_json)
# Print the first result title
print(response["organic_results"][0]["title"])
2. Build your own Python scraper
I recently wrote an in-depth blog post on how to scrape search results with Python.
Here is a quick summary.
First you should get the HTML contents of the Google search result page.
import urllib.request
url = 'https://google.com/search?q=Where+can+I+get+the+best+coffee'
# Perform the request
request = urllib.request.Request(url)
# Set a normal User Agent header, otherwise Google will block the request.
request.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36')
raw_response = urllib.request.urlopen(request).read()
# Read the repsonse as a utf-8 string
html = raw_response.decode("utf-8")
Then you can use BeautifulSoup to extract the search results.
For example, the following code will get all titles.
from bs4 import BeautifulSoup
# The code to get the html contents here.
soup = BeautifulSoup(html, 'html.parser')
# Find all the search result divs
divs = soup.select("#search div.g")
for div in divs:
# Search for a h3 tag
results = div.select("h3")
# Check if we have found a result
if (len(results) >= 1):
# Print the title
h3 = results[0]
print(h3.get_text())
You can extend this code to also extract the search result urls and descriptions.
Current answers will work but google will ban your for scrapping.
My current solution uses the requests_ip_rotator
import requests
from requests_ip_rotator import ApiGateway
import os
keywords = ['test']
def parse(keyword, session):
url = f"https://www.google.com/search?q={keyword}"
response = session.get(url)
print(response)
if __name__ == '__main__':
AWS_ACCESS_KEY_ID = ''
AWS_SECRET_ACCESS_KEY = ''
gateway = ApiGateway("https://www.google.com", access_key_id=AWS_ACCESS_KEY_ID,
access_key_secret=AWS_SECRET_ACCESS_KEY)
gateway.start()
session = requests.Session()
session.mount("https://www.google.com", gateway)
for keyword in keywords:
parse(keyword, session)
gateway.shutdown()
AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY you can create in AWS console.
This solution allow you to parse 1 million requests (amazon free limit)
I’ve been learning a lot of python lately to work on some projects at work.
Currently I need to do some web scraping with google search results. I found several sites that demonstrated how to use ajax google api to search, however after attempting to use it, it appears to no longer be supported. Any suggestions?
I’ve been searching for quite a while to find a way but can’t seem to find any solutions that currently work.
You can always directly scrape Google results. To do this, you can use the URL https://google.com/search?q=<Query>
this will return the top 10 search results.
Then you can use lxml for example to parse the page. Depending on what you use, you can either query the resulting node tree via a CSS-Selector (.r a
) or using a XPath-Selector (//h3[@class="r"]/a
)
In some cases the resulting URL will redirect to Google. Usually it contains a query-parameter q
which will contain the actual request URL.
Example code using lxml and requests:
from urllib.parse import urlencode, urlparse, parse_qs
from lxml.html import fromstring
from requests import get
raw = get("https://www.google.com/search?q=StackOverflow").text
page = fromstring(raw)
for result in page.cssselect(".r a"):
url = result.get("href")
if url.startswith("/url?"):
url = parse_qs(urlparse(url).query)['q']
print(url[0])
A note on google banning your IP: In my experience, google only bans
if you start spamming google with search requests. It will respond
with a 503 if Google thinks you are bot.
You can also use a third party service like Serp API – I wrote and run this tool – that is a paid Google search engine results API. It solves the issues of being blocked, and you don’t have to rent proxies and do the result parsing yourself.
It’s easy to integrate with Python:
from lib.google_search_results import GoogleSearchResults
params = {
"q" : "Coffee",
"location" : "Austin, Texas, United States",
"hl" : "en",
"gl" : "us",
"google_domain" : "google.com",
"api_key" : "demo",
}
query = GoogleSearchResults(params)
dictionary_results = query.get_dictionary()
GitHub: https://github.com/serpapi/google-search-results-python
Here is another service that can be used for scraping SERPs (https://zenserp.com) It does not require a client and is cheaper.
Here is a python code sample:
import requests
headers = {
'apikey': '',
}
params = (
('q', 'Pied Piper'),
('location', 'United States'),
('search_engine', 'google.com'),
('language', 'English'),
)
response = requests.get('https://app.zenserp.com/api/search', headers=headers, params=params)
You have 2 options. Building it yourself or using a SERP API.
A SERP API will return the Google search results as a formatted JSON response.
I would recommend a SERP API as it is easier to use, and you don’t have to worry about getting blocked by Google.
1. SERP API
I have good experience with the scraperbox serp api.
You can use the following code to call the API. Make sure to replace YOUR_API_TOKEN
with your scraperbox API token.
import urllib.parse
import urllib.request
import ssl
import json
ssl._create_default_https_context = ssl._create_unverified_context
# Urlencode the query string
q = urllib.parse.quote_plus("Where can I get the best coffee")
# Create the query URL.
query = "https://api.scraperbox.com/google"
query += "?token=%s" % "YOUR_API_TOKEN"
query += "&q=%s" % q
query += "&proxy_location=gb"
# Call the API.
request = urllib.request.Request(query)
raw_response = urllib.request.urlopen(request).read()
raw_json = raw_response.decode("utf-8")
response = json.loads(raw_json)
# Print the first result title
print(response["organic_results"][0]["title"])
2. Build your own Python scraper
I recently wrote an in-depth blog post on how to scrape search results with Python.
Here is a quick summary.
First you should get the HTML contents of the Google search result page.
import urllib.request
url = 'https://google.com/search?q=Where+can+I+get+the+best+coffee'
# Perform the request
request = urllib.request.Request(url)
# Set a normal User Agent header, otherwise Google will block the request.
request.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36')
raw_response = urllib.request.urlopen(request).read()
# Read the repsonse as a utf-8 string
html = raw_response.decode("utf-8")
Then you can use BeautifulSoup to extract the search results.
For example, the following code will get all titles.
from bs4 import BeautifulSoup
# The code to get the html contents here.
soup = BeautifulSoup(html, 'html.parser')
# Find all the search result divs
divs = soup.select("#search div.g")
for div in divs:
# Search for a h3 tag
results = div.select("h3")
# Check if we have found a result
if (len(results) >= 1):
# Print the title
h3 = results[0]
print(h3.get_text())
You can extend this code to also extract the search result urls and descriptions.
Current answers will work but google will ban your for scrapping.
My current solution uses the requests_ip_rotator
import requests
from requests_ip_rotator import ApiGateway
import os
keywords = ['test']
def parse(keyword, session):
url = f"https://www.google.com/search?q={keyword}"
response = session.get(url)
print(response)
if __name__ == '__main__':
AWS_ACCESS_KEY_ID = ''
AWS_SECRET_ACCESS_KEY = ''
gateway = ApiGateway("https://www.google.com", access_key_id=AWS_ACCESS_KEY_ID,
access_key_secret=AWS_SECRET_ACCESS_KEY)
gateway.start()
session = requests.Session()
session.mount("https://www.google.com", gateway)
for keyword in keywords:
parse(keyword, session)
gateway.shutdown()
AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY you can create in AWS console.
This solution allow you to parse 1 million requests (amazon free limit)