Scraping and parsing Google search results using Python

Question:

I asked a question on realizing a general idea to crawl and save webpages.
Part of the original question is: how to crawl and save a lot of “About” pages from the Internet.

With some further research, I got some choices to go ahead with both on scraping and parsing (listed at the bottom).

Today, I ran into another Ruby discussion about how to scrape from Google search results. This provides a great alternative for my problem which will save all the effort on the crawling part.

The new question are: in Python, to scrape Google search results for a given keyword, in this case “About”, and finally get the links for further parsing.
What are the best choices of methods and libraries to go ahead with? (in measure of easy-to-learn and easy-to-implement).

p.s. in this website, the exactly same thing is implemented, but closed and ask for money for more results. I’d prefer to do it myself if no open-source available and learn more Python in the meanwhile.

Oh, btw, advices for parsing the links from search results would be nice, if any. Still, easy-to-learn and easy-to-implement. Just started learning Python. 😛


Final update, problem solved. Code using xgoogle, please read note in the section below in order to make xgoogle working.

import time, random
from xgoogle.search import GoogleSearch, SearchError

f = open('a.txt','wb')

for i in range(0,2):
    wt = random.uniform(2, 5)
    gs = GoogleSearch("about")
    gs.results_per_page = 10
    gs.page = i
    results = gs.get_results()
    #Try not to annnoy Google, with a random short wait
    time.sleep(wt)
    print 'This is the %dth iteration and waited %f seconds' % (i, wt)
    for res in results:
        f.write(res.url.encode("utf8"))
        f.write("n")

print "Done"
f.close()

Note on xgoogle (below answered by Mike Pennington):
The latest version from it’s Github does not work by default already, due to changes in Google search results probably. These two replies (a b) on the home page of the tool give a solution and it is currently still working with this tweak. But maybe some other day it may stop working again due to Google’s change/block.


Resources known so far:

  • For scraping, Scrapy seems to be a popular choice and a webapp called ScraperWiki is very interesting and there is another project extract it’s library for offline/local usage. Mechanize was brought up quite several times in different discussions too.

  • For parsing HTML, BeautifulSoup seems to be the one of the most
    popular choices. Of course. lxml too.

Asked By: Flake

||

Answers:

You may find xgoogle useful… much of what you seem to be asking for is there…

Answered By: Mike Pennington

There is a twill lib for emulating browser. I used it when had a necessity to login with google email account. While it’s a great tool with a great idea, it’s pretty old and seems to have a lack of support nowadays (the latest version is released in 2007).
It might be useful if you want to retrieve results that require cookie-handling or authentication. Likely that twill is one of the best choices for that purposes.
BTW, it’s based on mechanize.

As for parsing, you are right, BeautifulSoup and Scrapy are great. One of the cool things behind BeautifulSoup is that it can handle invalid HTML (unlike Genshi, for example.)

Answered By: ikostia

Have a look at this awesome urllib wrapper for web scraping https://github.com/mattseh/python-web/blob/master/web.py

Answered By: Justin Fay
from urllib.request import urlopen
from bs4 import BeautifulSoup
import urllib.request
import re

import numpy as np
count=0
query=input("query>>")
query=query.strip().split()
query="+".join(query)

html = "https://www.google.co.in/search?site=&source=hp&q="+query+"&gws_rd=ssl"
req = urllib.request.Request(html, headers={'User-Agent': 'Mozilla/5.0'})

soup = BeautifulSoup(urlopen(req).read(),"html.parser")

#Regex
reg=re.compile(".*&sa=")

links = []
#Parsing web urls
for item in soup.find_all('h3', attrs={'class' : 'r'}):
    line = (reg.match(item.a['href'][7:]).group())
    links.append(line[:-4])

print(links)

this should be handy….for more go to –
https://github.com/goyal15rajat/Crawl-google-search.git

Answered By: Rajat Goyal

This one works good for this moment. If any search is made, the scraper keeps grabbing titles and their links traversing all next pages until there is no more next page is left or your ip address is banned. Make sure your bs4 version is >= 4.7.0 as I’ve used pseudo css selector within the script.

from bs4 import BeautifulSoup
from urllib.parse import urljoin
import requests

base = "https://www.google.de"
link = "https://www.google.de/search?q={}"
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
}

def grab_content(link):
    res = requests.get(link,headers=headers)
    soup = BeautifulSoup(res.text,"lxml")
    for container in soup.select("[class='g'] a[href^='http'][data-ved]:has(h3)"):
        post_title = container.select_one("h3").get_text(strip=True)
        post_link = container.get('href')
        yield post_title,post_link

    next_page = soup.select_one("a[href][id='pnnext']")
    if next_page:
        next_page_link = urljoin(base,next_page.get("href"))
        yield from grab_content(next_page_link)

if __name__ == '__main__':
    search_keyword = "python"
    qualified_link = link.format(search_keyword.replace(" ","+"))
    for item in grab_content(qualified_link):
        print(item)
Answered By: SIM

Another option to scrape Google search results using Python is the one by ZenSERP.

I like the API-first approach which is easy to use and the JSON results are easily integrated into our solution.

Here is an example for a curl request:

curl "https://app.zenserp.com/api/search" -F "q=Pied Piper" -F "location=United States" -F "search_engine=google.com" -F "language=English" -H "apikey: APIKEY"

And the response:

{
  "q": "Pied Piper",
  "domain": "google.com",
  "location": "United States",
  "language": "English",
  "url": "https://www.google.com/search?q=Pied%20Piper&num=100&hl=en&gl=US&gws_rd=cr&ie=UTF-8&oe=UTF-8&uule=w+CAIQIFISCQs2MuSEtepUEUK33kOSuTsc",
  "total_results": 17100000,
  "auto_correct": "",
  "auto_correct_type": "",
  "results": []
}

A Python code for example:

import requests

headers = {
    'apikey': 'APIKEY',
}

params = (
    ('q', 'Pied Piper'),
    ('location', 'United States'),
    ('search_engine', 'google.com'),
    ('language', 'English'),
)

response = requests.get('https://app.zenserp.com/api/search', headers=headers, params=params)
Answered By: Johnny

Here is a Python script using requests and BeautifulSoup to scrape Google results.

import urllib
import requests
from bs4 import BeautifulSoup

# desktop user-agent
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0"
# mobile user-agent
MOBILE_USER_AGENT = "Mozilla/5.0 (Linux; Android 7.0; SM-G930V Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.125 Mobile Safari/537.36"

query = "hackernoon How To Scrape Google With Python"
query = query.replace(' ', '+')
URL = f"https://google.com/search?q={query}"

headers = {"user-agent": USER_AGENT}
resp = requests.get(URL, headers=headers)

if resp.status_code == 200:
    soup = BeautifulSoup(resp.content, "html.parser")
    results = []
    for g in soup.find_all('div', class_='r'):
        anchors = g.find_all('a')
        if anchors:
            link = anchors[0]['href']
            title = g.find('h3').text
            item = {
                "title": title,
                "link": link
            }
            results.append(item)
    print(results)
Answered By: Learning C

To extract links from multiple pages of Google Search results you can use SerpApi. It’s a paid API with a free trial.

Full example

import os

# Python package: https://pypi.org/project/google-search-results
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "about",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)

pages = search.pagination()

for result in pages:
    print(f"Current page: {result['serpapi_pagination']['current']}n")

    for organic_result in result["organic_results"]:
        print(
            f"Title: {organic_result['title']}nLink: {organic_result['link']}n"
        )

Output

Current page: 12
URL: https://fi.google.com/
URL: https://www.mayoclinic.org/about-mayo-clinic

...

Current page: 18
URL: https://igem.org/About
URL: https://www.ieee.org/
URL: https://www.cancer.org/

...

Disclaimer: I work at SerpApi.

Answered By: ilyazub

This can be done using google and beautifulsoup module, install it in CMD using command given below:

pip install google beautifulsoup4

Thereafter, run this simplified code given below

import webbrowser, googlesearch as gs
def direct(txt):
    print(f"sure, searching '{txt}'...")
    results=gs.search(txt,num=1,stop=1,pause=0) 
    #num, stop denotes number of search results you want

    for link in results:
        print(link)
        webbrowser.open_new_tab(link)#to open the results in browser

direct('cheap thrills on Youtube') #this will play the song on YouTube
                                    #(for this, keep num=1,stop=1)

Output:
enter image description here

TIP: Using this, you can also make a small Virtual Assistant that will open the top search result in browser for your given query(txt) in natural language.
Feel free to comment in case of difficulty while running this code:)

Answered By: Prashantzz

You can also use Serpdog’s Google Scraper API to extract links from Google Search Results.
It’s a paid API but also offers a free trial.

import requests

def get_data():
response = requests.get("https://api.serpdog.io/search?api_key=APIKEY&q=about&gl=us")
organic_results = response.json()["organic_results"]


for result in organic_results:
    print(result["link"])
get_data()

Results:

https://www.merriam-webster.com/dictionary/about
https://www.dictionary.com/browse/about
https://en.wiktionary.org/wiki/about
https://dictionary.cambridge.org/us/dictionary/english/about
https://www.thesaurus.com/browse/about
....

Disclaimer: I am the founder of serpdog.io.

Answered By: Darshan