Python library newspaper is not returning the published date

Question:

I am using newspaper python library to extract some data from new stories. The problem is that I am not getting this data for some URLs. These URLs work fine. They all return 200. I am doing this for a very large dataset but this is one of the URLs for which the date extraction did not work. The code works for some links and not others (from the same domain) so I know that the problem isn’t something like my IP being blocked for too many requests. I tried it on just one URL and getting the same result (no data).

import os
import sys
from newspaper import Article   

def split(link):
        try:
            story = Article(link)
            story.download()
            story.parse()
            date_time = str(story.publish_date)
            split_date = date_time.split()  
            date = split_date[0]
            if date != "None":
                print(date)
        except:
            print("This URL did not return a published date. Try a different URL.")
            print(link)

if __name__ == "__main__":
        link = "https://www.aljazeera.com/program/featured-documentaries/2020/12/29/lords-of-water-episode-one"
        split(link)

I am getting this output:

This URL did not return a published date. Try a different URL.
https://www.aljazeera.com/program/featured-documentaries/2020/12/29/lords-of-water-episode-one

Asked By: Sam Hall

||

Answers:

Please check the link, I checked the link and it’s unavailable now.
If link is unavailable, the code will not be work.

Try adding some error handling to your code to catch URLs that return a 404, such as this one: https://www.aljazeera.com/program/featured-documentaries/2020/12/29/lords-of-water-episode-one

from newspaper import Config
from newspaper import Article
from newspaper.article import ArticleException

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'

config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10

base_url = 'https://www.aljazeera.com/program/featured-documentaries/2020/12/29/lords-of-water-episode-one'
try:
    article = Article(base_url, config=config)
    article.download()
    article.parse()
except ArticleException as error:
    print(error)

Output:

Article `download()` failed with 404 Client Error: Not Found for url: https://www.aljazeera.com/program/featured-documentaries/2020/12/29/lords-of-water-episode-one on URL https://www.aljazeera.com/program/featured-documentaries/2020/12/29/lords-of-water-episode-one

Newspaper3k has multiple ways to extract the publish dates from articles. Take a look at this document that I wrote on how to use Newspaper3k.

Here is an example for this valid URL https://www.aljazeera.com/program/featured-documentaries/2022/3/31/lords-of-water that extracts data elements from the page’s meta tags.

from newspaper import Config
from newspaper import Article
from newspaper.article import ArticleException

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'

config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10

base_url = 'https://www.aljazeera.com/program/featured-documentaries/2022/3/31/lords-of-water'
try:
    article = Article(base_url, config=config)
    article.download()
    article.parse()
    article_meta_data = article.meta_data

    article_title = [value for (key, value) in article_meta_data.items() if key == 'pageTitle']
    print(article_title)

    article_published_date = str([value for (key, value) in article_meta_data.items() if key == 'publishedDate'])
    print(article_published_date)

    article_description = [value for (key, value) in article_meta_data.items() if key == 'description']
    print(article_description)

except ArticleException as error:
    print(error)

Output

['Lords of Water']
['2022-03-31T06:08:59']
['Is water the new oil? We expose the financialisation of water.']
Answered By: Life is complex