extracting href from <a> beautiful soup

Question:

I’m trying to extract a link from a google search result. Inspect element tells me that the section I am interested in has “class = r”. The first result looks like this:

<h3 class="r" original_target="https://en.wikipedia.org/wiki/chocolate" style="display: inline-block;">
    <a href="https://en.wikipedia.org/wiki/Chocolate" 
       ping="/url?sa=t&amp;source=web&amp;rct=j&amp;url=https://en.wikipedia.org/wiki/Chocolate&amp;ved=0ahUKEwjW6tTC8LXZAhXDjpQKHSXSClIQFgheMAM" 
       saprocessedanchor="true">
        Chocolate - Wikipedia
    </a>
</h3>

To extract the “href” I do:

import bs4, requests
res = requests.get('https://www.google.com/search?q=chocolate')
googleSoup = bs4.BeautifulSoup(res.text, "html.parser")
elements= googleSoup.select(".r a")
elements[0].get("href")

But I unexpectedly get:

'/url?q=https://en.wikipedia.org/wiki/Chocolate&sa=U&ved=0ahUKEwjHjrmc_7XZAhUME5QKHSOCAW8QFggWMAA&usg=AOvVaw03f1l4EU9fYd'

Where I wanted:

"https://en.wikipedia.org/wiki/Chocolate"

The attribute “ping” seems to be confusing it. Any ideas?

Asked By: GlaceCelery

||

Answers:

What’s happening?

If you print the response content (i.e. googleSoup.text) you’ll see that you’re getting a completely different HTML. The page source and the response content don’t match.

This is not happening because the content is loaded dynamically; as even then, the page source and the response content are the same. (But the HTML you see while inspecting the element is different.)

A basic explanation for this is that Google recognizes the Python script and changes its response.

Solution:

You can pass a fake User-Agent to make the script look like a real browser request.


Code:

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}

r = requests.get('https://www.google.co.in/search?q=chocolate', headers=headers)
soup = BeautifulSoup(r.text, 'lxml')

elements = soup.select('.r a')
print(elements[0]['href'])

Output:

https://en.wikipedia.org/wiki/Chocolate

Resources:

Answered By: Keyur Potdar

As the other answer mentioned, it’s because there was no user-agent specified. The default requests user-agent is python-requests thus Google blocks a request because it knows that it’s a bot and not a "real" user visit.

User-agent fakes user visit by adding this information into HTTP request headers. It can be done by passing custom headers (check what’s yours user-agent):

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get("YOUR_URL", headers=headers)

Additionally, to get more accurate results you can pass URL parameters:

params = {
  "q": "samurai cop, what does katana mean",  # query
  "gl": "in",                                 # country to search from
  "hl": "en"                                  # language
  # other parameters 
}
requests.get("YOUR_URL", params=params)

Code and full example in the online IDE (code from another answer will throw an error because of CSS selector change):

from bs4 import BeautifulSoup
import requests, lxml

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  "q": "samurai cop what does katana mean",
  "gl": "in",
  "hl": "en"
}

html = requests.get("https://www.google.com/search", headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')

for result in soup.select('.tF2Cxc'):
  title = result.select_one('.DKV0Md').text
  link = result.select_one('.yuRUbf a')['href']
  print(f'{title}n{link}n')

-------
'''
Samurai Cop - He speaks fluent Japanese - YouTube


Samurai Cop - What does "katana" mean? - Quotes.net
https://www.quotes.net/mquote/1060647

Samurai Cop (1991) - Mathew Karedas as Joe Marshall - IMDb
https://www.imdb.com/title/tt0130236/characters/nm0360481

...
'''

Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It’s a paid API with a free plan.

The difference in your case is that you only need to iterate over structured JSON and get the data you want fast, rather than figuring out why certain things don’t work as they should and then maintain the parser over time.

Code to integrate:

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "samurai cop what does katana mean",
    "hl": "en",
    "gl": "in",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results["organic_results"]:
  print(result['title'])
  print(result['link'])
  print()

------
'''
Samurai Cop - He speaks fluent Japanese - YouTube


Samurai Cop - What does "katana" mean? - Quotes.net
https://www.quotes.net/mquote/1060647
...
'''

Disclaimer, I work for SerpApi.

Answered By: Dmitriy Zub
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.