How to retrieve Google vocabulary with BeautifulSoup in Python

Question:

I would like to retrieve the result showed in the image:

Google result from the vocabulary

When I print out the soup, of course I cannot find the element which I find instead inspecting it in the browser:

response = requests.get(f"https://www.google.com/search?q={q}+definition")
soup = BeautifulSoup(response.text, 'html.parser')
print(soup)

How to retrieve these results that Google shows? And, for the sake of knowledge, what are they?

What I can get are just the two lines extracted from the first website Google has found relevant, which otherwise would have be enough.

Asked By: Neuran

||

Answers:

You can’t find the element you are looking for because the url you request doesn’t return it, in Python or in the browser.

Go to your browser (Firefox or Chrome) and press F12. It will open the console. Now go to the network tab and then visit/refresh the page.

The first request should be the same as what your Python is requesting. If you go to the response tab, you will see that the response matches the response you get from Python.

What is happening is that Google loads the initial request, then makes many more requests that bring back the information and update the page.

To get the information you want you can try using the Google API, the selenium library, try replicating the actual request that gets the definition (you can find this in the Network tab in your browser.), or by using a different sites dictionary.

enter image description here

Answered By: Rusty Robot

You can use CSS selectors search to find all the information you need (syllables, phonetic, etc.) which are easy to identify on the page using a SelectorGadget Chrome extension by clicking on the desired element in your browser (not always work perfectly if the website is rendered via JavaScript).

Make sure you’re using request headers user-agent to act as a "real" user visit. Because default requests user-agent is python-requests and websites understand that it’s most likely a script that sends a request. Check what’s your user-agent.

This is most likely the reason why results are different from what you see in the browser because the returned HTML you get after making a request is different with some sort of an error thus with different HTML elements.

Check code in online IDE.

from bs4 import BeautifulSoup
import requests, json, lxml

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36",
}

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls

# this URL params is taken from the actual Google search URL
# and transformed to a more readable format
params = {
  "q": "geenna definizione",           # query
  "gl": "it",                          # country to search from
  "hl": "it",                          # language
}

data = []

html = requests.get("https://www.google.com/search", headers=headers, params=params, timeout=30)
soup = BeautifulSoup(html.text, "lxml")

syllables = soup.select_one(".REww7c span").text
phonetic = soup.select_one(".qexShd .LTKOO").text
word_type = soup.select_one(".pgRvse").text
definitions = soup.select_one(".sY7ric span").text

data.append({
  "syllables": syllables,
  "phonetic": phonetic,
  "word_type": word_type,
  "definitions":definitions
})
print(json.dumps(data, indent=2, ensure_ascii=False))

Example output

[
  {
    "syllables": "Geenna",
    "phonetic": "/Ge·èn·na/",
    "word_type": "sostantivo femminile",
    "definitions": "La valle di Ennom a sud-ovest di Gerusalemme, maledetta dal re Giosia (perché sede del culto di Moloch, cui venivano offerti sacrifici umani) e destinata a immondezzaio della città; poiché vi ardeva continuamente il fuoco, nel Vangelo è presa a simbolo dell'Inferno."
  }
]

As an alternative you can use Google Direct Answer Box API from SerpApi. It’s a paid API with the free plan.
The difference is that it will bypass blocks (including CAPTCHA) from Google, no need to create the parser and maintain it.

Code example:

from serpapi import GoogleSearch
import os, json

params = {
    # https://docs.python.org/3/library/os.html#os.getenv
    "api_key": os.getenv("API_KEY"),  # serpapi api key from https://serpapi.com/manage-api-key
    "engine": "google",               # search engine
    "q": "geenna definizione",        # search query
    "gl": "it",                       # country to search from
    "hl": "it",                       # language
  }

search = GoogleSearch(params)
results = search.get_dict()

data = results['answer_box']
   
print(json.dumps(data, indent=2, ensure_ascii=False))

Output:

[
  {
  "type": "dictionary_results",
  "syllables": "Geenna",
  "phonetic": "/Ge·èn·na/",
  "word_type": "sostantivo femminile",
  "definitions": [
    "La valle di Ennom a sud-ovest di Gerusalemme, maledetta dal re Giosia (perché sede del culto di Moloch, cui venivano offerti sacrifici umani) e destinata a immondezzaio della città; poiché vi ardeva continuamente il fuoco, nel Vangelo è presa a simbolo dell'Inferno."
  ]
  }
]

There’s a Scrape Multiple Google Answer Box Layouts with Python blog post if you need to scrape more data from Google Answer Box.

Disclaimer, I work for SerpApi.

Answered By: Denis Skopa
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.