Web scraping articles from Google News

Question

I am trying to web scrape googlenews with the gnews package. However, I don’t know how to do web scraping for older articles like, for example, articles from 2010.

from gnews import GNews
from newspaper import Article
import pandas as pd
import datetime

google_news = GNews(language='es', country='Argentina', period = '7d')
argentina_news = google_news.get_news('protesta clarin')
print(len(argentina_news))

this code works perfectly to get recent articles but I need older articles. I saw https://github.com/ranahaani/GNews#todo and something like the following appears:

google_news = GNews(language='es', country='Argentina', period='7d', start_date='01-01-2015', end_date='01-01-2016', max_results=10, exclude_websites=['yahoo.com', 'cnn.com'],
                    proxy=proxy)

but when I try star_date I get:

TypeError: __init__() got an unexpected keyword argument 'start_date'

can anyone help to get articles for specific dates. Thank you very mucha guys!

Asked By: CRISTOBAL ARIEL PEREZ BARRAZA

||

Source

Answer 1

The example code is incorrect for gnews==0.2.7 which is the latest you can install off PyPI via pip (or whatever). The documentation is for the unreleased mainline code that you can get directly off their git source.

Confirmed by inspecting the GNews::__init__ method, and the method doesn’t have keyword args for start_date or end_date:

In [1]: import gnews

In [2]: gnews.GNews.__init__??
Signature:
gnews.GNews.__init__(
    self,
    language='en',
    country='US',
    max_results=100,
    period=None,
    exclude_websites=None,
    proxy=None,
)
Docstring: Initialize self.  See help(type(self)) for accurate signature.
Source:
    def __init__(self, language="en", country="US", max_results=100, period=None, exclude_websites=None, proxy=None):
        self.countries = tuple(AVAILABLE_COUNTRIES),
        self.languages = tuple(AVAILABLE_LANGUAGES),
        self._max_results = max_results
        self._language = language
        self._country = country
        self._period = period
        self._exclude_websites = exclude_websites if exclude_websites and isinstance(exclude_websites, list) else []
        self._proxy = {'http': proxy, 'https': proxy} if proxy else None
File:      ~/src/news-test/.venv/lib/python3.10/site-packages/gnews/gnews.py
Type:      function

If you want the start_date and end_date functionality, that was only added rather recently, so you will need to install the module off their git source.

# use whatever you use to uninstall any pre-existing gnews module
pip uninstall gnews

# install from the project's git main branch
pip install git+https://github.com/ranahaani/GNews.git

Now you can use the start/end functionality:

import datetime

import gnews

start = datetime.date(2015, 1, 15)
end = datetime.date(2015, 1, 16)

google_news = GNews(language='es', country='Argentina', start_date=start, end_date=end)
rsp = google_news.get_news("protesta")
print(rsp)

I get this as a result:

[{'title': 'Latin Roots: The Protest Music Of South America - NPR',
  'description': 'Latin Roots: The Protest Music Of South America  NPR',
  'published date': 'Thu, 15 Jan 2015 08:00:00 GMT',
  'url': 'https://www.npr.org/sections/world-cafe/2015/01/15/377491862/latin-roots-the-protest-music-of-south-america',
  'publisher': {'href': 'https://www.npr.org', 'title': 'NPR'}}]

Also note:

period is ignored if you set start_date and end_date
Their documentation shows you can pass the dates as tuples like (2015, 1, 15). This doesn’t seem to work – just be safe and pass a datetime object.

Answered By: wkl

Answer 2

You can also use Python requests module and xpath to get what you need without using any external packages.
Here is a snapshot of the code:

from bs4 import BeautifulSoup
import requests
from lxml.html import fromstring



url = 'https://www.google.com/search?q=google+news&&hl=es&sxsrf=ALiCzsZoYzwIP0ZR9d6LLa5U6IJ2WDo1sw%3A1660116293247&source=lnt&tbs=cdr%3A1%2Ccd_min%3A8%2F10%2F2010%2Ccd_max%3A8%2F10%2F2022&tbm=nws'
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4758.87 Safari/537.36",
    }

r = requests.get(url,  headers=headers, timeout=30)
root = fromstring(r.text)

news = []
for i in root.xpath('//div[@class="xuvV6b BGxR7d"]'):
    item={}
    item['title'] =  i.xpath('.//div[@class="mCBkyc y355M ynAwRc MBeuO nDgy9d"]//text()')
    item['description'] =  i.xpath('.//div[@class="GI74Re nDgy9d"]//text()')
    item['published date'] =  i.xpath('.//div[@class="OSrXXb ZE0LJd"]//span/text()')
    item['url'] =  i.xpath('.//a/@href')
    item['publisher'] =  i.xpath('.//div[@class="CEMjEf NUnG9d"]//span/text()')
    news.append(item)

And here is what i get:

for i in news:
    print i

"""
{'published date': ['Hace 1 mes'], 'url': ['https://www.20minutos.es/noticia/5019464/0/google-news-regresa-a-espana-tras-ocho-anos-cerrado/'], 'publisher': ['20Minutos'], 'description': [u'"Google News ayuda a los lectores a encontrar noticias de fuentes nfidedignas, desde los sitios web de noticias mxe1s grandes del mundo hasta nlas publicaciones...'], 'title': [u'Noticias de 20minutos en Google News: cxf3mo seguir la xfaltima ...']}
{'published date': ['14 jun 2022'], 'url': ['https://www.bbc.com/mundo/noticias-61803565'], 'publisher': ['BBC'], 'description': [u'Cxf3mo funciona LaMDA, el sistema de inteligencia artificial que "cobrxf3 nconciencia y siente" segxfan un ingeniero de Google. Alicia Hernxe1ndez n@por_puesto; BBC News...'], 'title': [u'Cxf3mo funciona LaMDA, el sistema de inteligencia artificial que "cobrxf3 nconciencia y siente" segxfan un ingeniero de Google']}
{'published date': ['24 mar 2022'], 'url': ['https://www.theguardian.com/world/2022/mar/24/russia-blocks-google-news-after-it-bans-ads-on-proukraine-invasion-content'], 'publisher': ['The Guardian'], 'description': [u'Russia has blocked Google News, accusing it of promoting u201cinauthentic ninformationu201d about the invasion of Ukraine. The ban came just hours after nGoogle...'], 'title': ['Russia blocks Google News after ad ban on content condoning Ukraine invasion']}
{'published date': ['2 feb 2021'], 'url': ['https://dircomfidencial.com/medios/google-news-showcase-que-es-y-como-funciona-el-agregador-por-el-que-los-medios-pueden-generar-ingresos-20210202-0401/'], 'publisher': ['Dircomfidencial'], 'description': [u'Google News Showcase: quxe9 es y cxf3mo funciona el agregador por el que los nmedios pueden generar ingresos. MEDIOS | 2 FEBRERO 2021 | ACTUALIZADO: 3 nFEBRERO 2021 8...'], 'title': [u'Google News Showcase: quxe9 es y cxf3mo funciona el ...']}
{'published date': ['4 nov 2021'], 'url': ['https://www.euronews.com/next/2021/11/04/google-news-returns-to-spain-after-the-country-adopts-new-eu-copyright-law'], 'publisher': ['Euronews'], 'description': ['News aggregator Google News will return to Spain following a change in ncopyright law that allows online platforms to negotiate fees directly with ncontent...'], 'title': ['Google News returns to Spain after the country adopts new EU copyright law']}
{'published date': ['27 may 2022'], 'url': ['https://indianexpress.com/article/technology/tech-news-technology/google-hit-with-fresh-uk-investigation-over-ad-tech-dominance-7938896/'], 'publisher': ['The Indian Express'], 'description': ['The Indian Express website has been rated GREEN for its credibility and ntrustworthiness by Newsguard, a global service that rates news sources for ntheir...'], 'title': ['Google hit with fresh UK investigation over ad tech dominance']}
{'published date': [u'Hace 1 dxeda'], 'url': ['https://indianexpress.com/article/technology/tech-news-technology/google-down-outage-issues-user-error-8079170/'], 'publisher': ['The Indian Express'], 'description': ['The outage also impacted a range of other Google products such as Google n... Join our Telegram channel (The Indian Express) for the latest news and nupdates.'], 'title': ['Google, Google Maps and other services recover after global ...']}
{'published date': ['14 nov 2016'], 'url': ['https://www.reuters.com/article/us-alphabet-advertising-idUSKBN1392MM'], 'publisher': ['Reuters'], 'description': ["Google's move similarly does not address the issue of fake news or hoaxes nappearing in Google search results. That happened in the last few days, nwhen a search..."], 'title': ['Google, Facebook move to restrict ads on fake news sites']}
{'published date': ['27 sept 2021'], 'url': ['https://news.sky.com/story/googles-appeal-against-eu-record-3-8bn-fine-starts-today-as-us-cases-threaten-to-break-the-company-up-12413655'], 'publisher': ['Sky News'], 'description': ["Google's five-day appeal against the decision is being heard at European n... told Sky News he expected there could be another appeal after the nhearing in..."], 'title': [u"Google's appeal against EU record xa33.8bn fine starts today, as US cases nthreaten to break the company up"]}
{'published date': ['11 jun 2022'], 'url': ['https://www.washingtonpost.com/technology/2022/06/11/google-ai-lamda-blake-lemoine/'], 'publisher': ['The Washington Post'], 'description': [u"SAN FRANCISCO u2014 Google engineer Blake Lemoine opened his laptop to the ninterface for LaMDA, Google's artificially intelligent chatbot generator,..."], 'title': ["The Google engineer who thinks the company's AI has come ..."]}
"""

Answered By: Billy Jhon

Web scraping articles from Google News

Question:

Answers: