Scraping google news with BeautifulSoup returns empty results

Question:

I am trying to scrape google news using the following code:

from bs4 import BeautifulSoup
import requests
import time
from random import randint


def scrape_news_summaries(s):
    time.sleep(randint(0, 2))  # relax and don't let google be angry
    r = requests.get("http://www.google.co.uk/search?q="+s+"&tbm=nws")
    content = r.text
    news_summaries = []
    soup = BeautifulSoup(content, "html.parser")
    st_divs = soup.findAll("div", {"class": "st"})
    for st_div in st_divs:
        news_summaries.append(st_div.text)
    return news_summaries


l = scrape_news_summaries("T-Notes")
#l = scrape_news_summaries("""T-Notes""")
for n in l:
    print(n)

Even though this bit of code was working before, I now can’t figure out why it’s not working anymore. Is it possible that I’ve been banned by google since I only ran the code 3 or four times? (I tried using Bing News with unfortunate empty results too…)

Thanks.

Asked By: ylnor

||

Answers:

I tried running the code and it works fine on my computer.

You could try printing the status code for the request, and see if it’s anything other than 200.

from bs4 import BeautifulSoup
import requests
import time
from random import randint


def scrape_news_summaries(s):
    time.sleep(randint(0, 2))  # relax and don't let google be angry
    r = requests.get("http://www.google.co.uk/search?q="+s+"&tbm=nws")
    print(r.status_code)  # Print the status code
    content = r.text
    news_summaries = []
    soup = BeautifulSoup(content, "html.parser")
    st_divs = soup.findAll("div", {"class": "st"})
    for st_div in st_divs:
        news_summaries.append(st_div.text)
    return news_summaries


l = scrape_news_summaries("T-Notes")
#l = scrape_news_summaries("""T-Notes""")
for n in l:
    print(n)

https://www.scrapehero.com/how-to-prevent-getting-blacklisted-while-scraping/ for a list of status code that’s a sign you have been banned.

Answered By: Andreas

Using time.sleep(randint(0, 2)) is not the most reliable way to bypass blocks

There are several steps to bypass blocking:

  1. Make sure you’re using request headers user-agent to act as a "real" user visit. Because default requests user-agent is python-requests and websites understand that it’s most likely a script that sends a request. Check what’s your user-agent.
    Using the User Agent is more reliable (but up to a certain point).
  2. Having one user-agent is not enough but you can rotate them to make it a bit more reliable.
  3. Sometimes passing only user-agent isn’t enough. You can pass additional headers. See more HTTP request headers that you can send while making a request.
  4. The most reliable way to bypass blocking is residential proxies. Residential proxies allow you to choose a specific location (country, city, or mobile carrier) and surf the web as a real user in that area. Proxies can be defined as intermediaries that protect users from general web traffic. They act as buffers while also concealing your IP address.
  5. Using a non-overused proxies is the best option. You can scrape a lot of public proxies and save them to a list(), or save it to .txt file to save memory and iterate over them while making a request to see what’s the results would be, and then move to different types of proxies if the result is not what you were looking for.
  6. You can be whitelisted. Get whitelisted means to add IP addresses to allow lists in website which explicitly allows some identified entities to access a particular privilege, i.e. it is a list of things allowed when everything is denied by default. One of the ways to become whitelisted is you can regularly do something useful for "them" based on scraped data which could lead to some insights.

For more information on how to bypass blocking, you can read the Reducing the chance of being blocked while web scraping blog post.

You can also check the response with status_code. If a bad request was made (client error 4XX or server error response 5XX), then it can be raised with Response.raise_for_status(). But if the status code for the request is 200 and we call raise_for_status() we get None. This means that there are no errors and everything is fine.

Code and full example in online IDE:

from bs4 import BeautifulSoup
from random import randint
import requests, time, json, lxml


def scrape_news_summaries(query):
    # https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
    params = {
        "q": query,
        "hl": "en-US",          # language
        "gl": "US",             # country of the search, US -> USA
        "tbm": "nws",           # google news
    }
    
    # https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.88 Safari/537.36",
    }
    
    time.sleep(randint(0, 2))   # relax and don't let google be angry
    
    html = requests.get("http://www.google.co.uk/search", params=params, headers=headers, timeout=30)
    soup = BeautifulSoup(html.text, "lxml")
    
    news_summaries = []
    
    if html.status_code == 200:
        for result in soup.select(".WlydOe"):
            source = result.select_one(".NUnG9d span").text
            title = result.select_one(".mCBkyc").text
            link = result["href"]
            snippet = result.select_one(".GI74Re").text
            date = result.select_one(".ZE0LJd span").text
        
            news_summaries.append({
                "source": source,
                "title": title,
                "link": link,
                "snippet": snippet,
                "date": date
            })
        
    return news_summaries

print(json.dumps(scrape_news_summaries("T-Notes"), indent=2, ensure_ascii=False)) 

Output:

[
  {
    "source": "Barchart.com",
    "title": "U.S. Stocks Undercut As T-Note Yield Rises",
    "link": "https://www.barchart.com/story/news/9572904/u-s-stocks-undercut-as-t-note-yield-rises",
    "snippet": "T-note prices are seeing some supply pressure ahead of today's Treasury nsale of $42 billion of 3-year T-notes. The Treasury will then sell $35...",
    "date": "2 days ago"
  },
  {
    "source": "Barchart.com",
    "title": "U.S. Stocks Rally As T-Note Yield Eases",
    "link": "https://www.barchart.com/story/news/9548700/u-s-stocks-rally-as-t-note-yield-eases",
    "snippet": "Stocks are seeing support from strong overseas stock markets and today's n-5.3 bp decline in the 10-year T-note yield to 2.774%.",
    "date": "2 days ago"
  },
  {
    "source": "PC Gamer",
    "title": "The internet won't stop roasting this new Forspoken trailer",
    "link": "https://www.pcgamer.com/the-internet-wont-stop-roasting-this-new-forspoken-trailer/",
    "snippet": "He covers all aspects of the industry, from new game announcements and npatch notes to legal disputes, Twitch beefs, esports, and Henry Cavill.",
    "date": "14 hours ago"
  },
  {
    "source": "ESPN",
    "title": "Fantasy football daily notes - Backfield rumors in New England, Miami",
    "link": "https://www.espn.com/fantasy/football/story/_/id/34383436/fantasy-football-daily-notes-backfield-rumors-new-england-miami",
    "snippet": "Right now, the second-year receiver is a nice value in fantasy football ndrafts, the WR37 in our draft trends, which isn't bad for a player...",
    "date": "1 hour ago"
  },
  {
    "source": "Cincinnati Bengals",
    "title": "Bengals Training Camp Notes: Dax Hill, Jackson Carman, Joe ...",
    "link": "https://www.bengals.com/news/jackson-carman-dax-hill-get-starts-in-preseason-opener",
    "snippet": "Jackson Carman gets Friday's start at left guard. Bengals head coach Zac nTaylor won't play most of his starters in Friday's (7:...",
    "date": "20 hours ago"
  },
  {
    "source": "Hoops Rumors",
    "title": "Texas Notes: Wood, Mavericks, Martin, T. Jones",
    "link": "https://www.hoopsrumors.com/2022/08/texas-notes-wood-mavericks-martin-t-jones.html",
    "snippet": "Texas Notes: Wood, Mavericks, Martin, T. Jones. August 6th 2022 at 10:59pm nCST by Arthur Hill. Christian Wood told WFAA TV that he's “counting my nblessings”...",
    "date": "4 days ago"
  },
  {
    "source": "Yahoo! Sports",
    "title": "Instant View: US CPI unchanged in July, raises hopes of Fed slowing",
    "link": "https://sports.yahoo.com/instant-view-us-cpi-unchanged-125340947.html",
    "snippet": "BONDS: The yield on 10-year Treasury notes was down 5.6 basis points to n2.741%; The two-year U.S. Treasury yield, was down 16.3 basis points...",
    "date": "1 day ago"
  },
  {
    "source": "NFL Trade Rumors",
    "title": "NFC Notes: Bears, Lions, Packers, Vikings - NFLTradeRumors ...",
    "link": "https://nfltraderumors.co/nfc-notes-bears-lions-packers-vikings-129/",
    "snippet": "Regarding Bears LB Roquan Smith requesting a trade, DE Robert Quinn nbelieves that Smith is deserving of a new contract: “You don't get a lot...",
    "date": "14 hours ago"
  },
  {
    "source": "ESPN",
    "title": "Fantasy football daily notes - Geno Smith, Albert Okwuegbunam trending up",
    "link": "https://www.espn.com/fantasy/football/story/_/id/34378640/fantasy-football-daily-notes-geno-smith-albert-okwuegbunam-trending-up",
    "snippet": "Read ESPN's fantasy football daily notes every weekday to stay caught ... nDon't overlook him in deep or tight end premium fantasy formats.",
    "date": "1 day ago"
  },
  {
    "source": "Hoops Rumors",
    "title": "Atlantic Notes: Quickley, Durant, Sixers, Raptors, R. Williams",
    "link": "https://www.hoopsrumors.com/2022/08/atlantic-notes-quickley-durant-sixers-raptors-r-williams.html",
    "snippet": "However, Morant said he and the former Kentucky standout aren't paying nattention to that trade speculation as they attempt to hone...",
    "date": "41 mins ago"
  }
]

If you don’t want to figure out how to use proxies, user-agent rotation, captcha solving and more things, there’s an already made APIs to make it work, for example, Google News Result API from SerpApi:

from serpapi import GoogleSearch
import os


def serpapi_code(query):
    params = {
        # https://docs.python.org/3/library/os.html#os.getenv
        "api_key": os.getenv("API_KEY"),    # your serpapi api key
        "engine": "google",                 # search engine
        "q": query,                         # search query
        "tbm": "nws",                       # google news
        "location": "Dallas",               # your location
        # other parameters
    }
    
    search = GoogleSearch(params)           # where data extraction happens on the SerpApi backend
    results = search.get_dict()             # JSON -> Python dict

    news_summaries = []
    
    for result in results["news_results"]:
        news_summaries.append({
            "source": result["source"],
            "title": result["title"],
            "link": result["link"],
            "snippet": result["snippet"],
            "date": result["date"]
        })

    return news_summaries
    
print(json.dumps(serpapi_code("T-Notes"), indent=2, ensure_ascii=False))

The output will be the same.

If you need a more detailed explanation about scraping Google News, have a look at Web Scraping Google News with Python blog post or you can watch the video.

Disclaimer, I work for SerpApi.

Answered By: Artur Chukhrai