Having trouble finding the text of google search result

Question

I’ve been trying to use BeautifulSoup to find the text of each search result on google. Using the developer tools, I can see that this is represented by a <h3> with the class " LC20lb DKV0Md ".

However I cant seem find this using BeautifulSoup. What am I doing wrong?

import requests
from bs4 import BeautifulSoup

res = requests.get('http://google.com/search?q=world+news')
soup = BeautifulSoup(res.content, 'html.parser')
soup.find_all('h3', class_= 'LC201b DKV0Md')

Asked By: coldlightning14

||

Source

Answer 1

You do not have to search by class, you simply can select all <h3> that includes a <div> and than get_text() of each:

import requests
from bs4 import BeautifulSoup

res = requests.get('http://google.com/search?q=world+news')
soup = BeautifulSoup(res.content, 'html.parser')

[x.get_text() for x in soup.select('h3 div')]

Output:

['World - BBC News',
 'BBC News World',
 'Latest news from around the world | The Guardian',
 'World - breaking news, videos and headlines - CNN',
 'CNN International - Breaking News, US News, World News and Video',
 'Welt-Nachrichten',
 'BBC World News (Fernsehsender)',
 'World News - Breaking international news and headlines | Sky News',
 'International News | Latest World News, Videos & Photos -ABC',
 'World News Headlines | Reuters',
 'World News - Hindustan Times',
 'World News | International Headlines - Breaking World - Global News']

Answered By: HedgeHog

Answer 2

If you having difficulties figuring out which element to use, have a look at SelectorGadget Chrome extension, it’s easier and faster than searching through dev tools. However, it does not always work perfectly if the website is rendered via JavaScript.

find the text of each search result on google.

Assuming that you meant "from all pages", you can scrape Google Search Results information from all pages using while True loop i,e pagination.

The while loop is an endless loop that will dynamically be paginating to the next page if .d6cvqb a[id=pnnext] selector is present which is responsible for the "next page" button:

if soup.select_one('.d6cvqb a[id=pnnext]'):
        params["start"] += 10 # increment a URL parameter that controls page number
else:
    break

Keep in mind that the request may be blocked if you’re using requests as the default user-agent in requests library is a python-requests.

To avoid it, one of the steps could be to rotate user-agent, for example, to switch between PC, mobile, and tablet, as well as between browsers e.g. Chrome, Firefox, Safari, Edge and so on.

Check code in online IDE.

from bs4 import BeautifulSoup
import requests, json, lxml

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "q": "world+news",   # query
    "hl": "en",          # language
    "gl": "us",          # country of the search, US -> USA
    "start": 0,          # number page by default up to 0
    #"num": 100          # parameter defines the maximum number of results to return.
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}

page_num = 0

website_data = []

while True:
    page_num += 1
    print(f"page: {page_num}")
        
    html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
    soup = BeautifulSoup(html.text, 'lxml')
    
    for result in soup.select(".tF2Cxc"):
        title = result.select_one(".DKV0Md").text
        website_link = result.select_one(".yuRUbf a")["href"]
                    
        website_data.append({
              "title": title,
              "website_link": website_link  
        })
      
    if soup.select_one('.d6cvqb a[id=pnnext]'):
        params["start"] += 10
    else:
        break

print(json.dumps(website_data, indent=2, ensure_ascii=False))

Example output:

[
    {
    "title": "World news – breaking news, videos and headlines - CNN",
    "website_link": "https://www.cnn.com/world"
  },
  {
    "title": "World - BBC News",
    "website_link": "https://www.bbc.com/news/world"
  },
  {
    "title": "Latest news from around the world | The Guardian",
    "website_link": "https://www.theguardian.com/world"
  },
  {
    "title": "World News Headlines | Reuters",
    "website_link": "https://www.reuters.com/news/archive/worldNews"
  },
  {
    "title": "World - NBC News",
    "website_link": "https://www.nbcnews.com/world"
  },
  # ...
]

Also you can use Google Search Engine Results API from SerpApi. It’s a paid API with the free plan.
The difference is that it will bypass blocks (including CAPTCHA) from Google, no need to create the parser and maintain it.

Code example:

from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import json, os

params = {
  "api_key": os.getenv("API_KEY"), # serpapi key
  "engine": "google",              # serpapi parser engine
  "q": "world+news",               # search query
  "num": "100"                     # number of results per page (100 per page in this case)
  # other search parameters: https://serpapi.com/search-api#api-parameters
}

search = GoogleSearch(params)      # where data extraction happens

organic_results_data = []
page_num = 0

while True:
    results = search.get_dict()    # JSON -> Python dictionary
    
    page_num += 1
    
    for result in results["organic_results"]:
        organic_results_data.append({
            "title": result.get("title") 
        })
    
    if "next_link" in results.get("serpapi_pagination", []):
        search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next_link")).query)))
    else:
        break
    
print(json.dumps(organic_results_data, indent=2, ensure_ascii=False))

Output:

[
   {
    "title": "World news – breaking news, videos and headlines - CNN"
  },
  {
    "title": "World - BBC News"
  },
  {
    "title": "Latest news from around the world | The Guardian"
  },
  {
    "title": "World News Headlines | Reuters"
  },
  {
    "title": "World - NBC News"
  },
  # ...
]

Answered By: Denis Skopa

Having trouble finding the text of google search result

Question:

Answers: