Searching in Google with Python

Question:

I want to search a text in Google using a python script and return the name, description and URL for each result. I’m currently using this code:

from google import search

ip=raw_input("What would you like to search for? ")

for url in search(ip, stop=20):
     print(url)

This returns only the URL’s. How can I return the name and description for each URL?

Asked By: Yarden

||

Answers:

I assume you are using this library by Mario Vilas because of the stop=20 argument which appears in his code. It seems like this library is not able to return anything but the URLs, making it horribly undeveloped. As such, what you want to do is not possible with the library you are currently using.

I would suggest you instead use abenassi/Google-Search-API. Then you can simply do:

from google import google
num_page = 3
search_results = google.search("This is my query", num_page)
for result in search_results:
    print(result.description)
Answered By: Jokab

Not exactly what I was looking for, but I found myself a nice solution for now (I might edit this if I will able to make this better). I combined searching in Google like I did (returning only URL) and the Beautiful Soup package for parsing HTML pages:

from googlesearch import search
import urllib
from bs4 import BeautifulSoup

def google_scrape(url):
    thepage = urllib.urlopen(url)
    soup = BeautifulSoup(thepage, "html.parser")
    return soup.title.text

i = 1
query = 'search this'
for url in search(query, stop=10):
    a = google_scrape(url)
    print str(i) + ". " + a
    print url
    print " "
    i += 1

This gives me a list of the title of pages and the link.

And another great solutions:

from googlesearch import search
import requests

for url in search(ip, stop=10):
            r = requests.get(url)
            title = everything_between(r.text, '<title>', '</title>')
Answered By: Yarden

Most of them I tried using, but didn’t work out for me or gave errors like search module not found despite importing packages. Or I did work out with selenium web driver and it works great if used with Firefox or chrome or Phantom web browser, but still I felt it was a bit slow in terms of execution time, as it queried browser first and then returned search result.

So I thought of using google api and it works amazingly quick and returns results accurately.

Before I share the code here are few quick tips to follow:-

  1. Register on Google Api to get a Google Api key (free version)
  2. Now search for Google Custom Search and set up your free account to get a custom search id
  3. Now add this package(google-api-python-client) in your python project
    (can be done by writing !pip install google-api-python-client )

That is it and all you have to do now is run this code:-

from googleapiclient.discovery import build

my_api_key = "your API KEY TYPE HERE"
my_cse_id = "YOUR CUSTOM SEARCH ENGINE ID TYPE HERE"

def google_search(search_term, api_key, cse_id, **kwargs):
      service = build("customsearch", "v1", developerKey=api_key)
      res = service.cse().list(q=search_term, cx=cse_id, **kwargs).execute()
      return res['items']

results= google_search("YOUR SEARCH QUERY HERE",my_api_key,my_cse_id,num=10) 

for result in results:
      print(result["link"])
Answered By: Piyush Rumao

You can also use a third-party service like SerpApi which is a Google search engine results. It solves the issues of having to rent proxies and parsing the HTML results. JSON output is particularly rich.

It’s easy to integrate with Python:

from serpapi import GoogleSearch

params = {
    "q" : "Coffee",
    "location" : "Austin, Texas, United States",
    "hl" : "en",
    "gl" : "us",
    "google_domain" : "google.com",
    "api_key" : "demo",
}

query = GoogleSearch(params)
dictionary_results = query.get_dict()

GitHub: https://github.com/serpapi/google-search-results-python

Answered By: Hartator

Usually, you cannot use google search function from python by importing google package in python3. but you can use it in python2.

Even by using the requests.get(url+query) the scraping won’t perform because google prevents scraping by redirecting it to captcha page.

Possible ways:

  • You can write code in python2
  • If you want to write it in python3, then make 2 files and retrieve search results from python2 script.
  • If found difficult, the best way is to use Google Colab or Jupyter Notebook with python3 runtime. You won’t get any error.
Answered By: Strange

You can use the Google Search Origin package which integrate most of the parameters available on google (it includes dorks and filters). It is based on the google official documentation. Moreover using it will create an object so it will be easily modifiable. For more information look at the project here: https://pypi.org/project/google-search-origin/

Here an example of how using it :

import google_search_origin


if __name__ == '__main__':
    # Initialisation of the class
    google_search = google_search_origin.GoogleSearchOrigin(search='sun')
    
    # Request from the url assembled
    google_search.request_url()

    # Display the link parsed depending on the result
    print(google_search.get_all_links())

    # Modify the parameter
    google_search.parameter_search('dog')

    # Assemble the url
    google_search.assemble_url()

    # Request from the url assembled
    google_search.request_url()

    # Display the raw text depending on the result
    print(google_search.get_response_text())
Answered By: Da2ny

As an alternative, if for some reason using the API is not necessary, you can get away with using BeautifulSoup web scraping library.

If necessary, you can extract data from all pages using an infinite while loop.

while loop will go through all pages no matter how many there are until a certain condition is fulfilled. In our case, this is the presence of a button on the page (.d6cvqb a[id=pnnext] CSS selector):

# stop the loop on the absence of the next page
if soup.select_one(".d6cvqb a[id=pnnext]"):
    params["start"] += 10
else:
    break

When you make a request, the site may decide that you are a bot, to prevent this from happening, you need to send headers that contain user-agent in the request, then the site will assume that you are a user and display the information.

Check full code in the online IDE.

from bs4 import BeautifulSoup
import requests, json, lxml

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
query = input("What would you like to search for? ")
params = {
    "q": query,          # query example
    "hl": "en",          # language
    "gl": "uk",          # country of the search, UK -> United Kingdom
    "start": 0,          # number page by default up to 0
    #"num": 100          # parameter defines the maximum number of results to return.
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"
}

page_limit = 10          # page limit if you don't need to fetch everything
page_num = 0

data = []

while True:
    page_num += 1
    print(f"page: {page_num}")
        
    html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
    soup = BeautifulSoup(html.text, 'lxml')
    
    for result in soup.select(".tF2Cxc"):
        title = result.select_one(".DKV0Md").text
        try:
           snippet = result.select_one(".lEBKkf span").text
        except:
           snippet = None
        links = result.select_one(".yuRUbf a")["href"]
      
        data.append({
          "title": title,
          "snippet": snippet,
          "links": links
        })
    
    # stop loop due to page limit condition
    if page_num == page_limit:
        break
    # stop the loop on the absence of the next page
    if soup.select_one(".d6cvqb a[id=pnnext]"):
        params["start"] += 10
    else:
        break
print(json.dumps(data, indent=2, ensure_ascii=False))

Example output:

[
  {
    "title": "Web Scraping with Python - Pluralsight",
    "snippet": "There are times in which you need data but there is no API (application programming interface) to be found. Web scraping is the process of extracting data ...",
    "links": "https://www.pluralsight.com/paths/web-scraping-with-python"
  },
  {
    "title": "Chapter 8 Web Scraping | Machine learning in python",
    "snippet": "Web scraping means extacting data from the “web”. However, web is not just an anonymous internet “out there” but a conglomerat of servers and sites, ...",
    "links": "http://faculty.washington.edu/otoomet/machinelearning-py/web-scraping.html"
  },
  {
    "title": "Web scraping 101",
    "snippet": "This vignette introduces you to the basics of web scraping with rvest. You'll first learn the basics of HTML and how to use CSS selectors to refer to ...",
    "links": "https://cran.r-project.org/web/packages/rvest/vignettes/rvest.html"
  },
  other results ...
]

There’s a 13 ways to scrape any public data from any website blog post if you want to know more about website scraping.

Answered By: Denis Skopa
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.