News Article is not being scraped in h2 class

Question:

I’m working on a project where I’ve been assigned to scrape all news articles from a website: ‘https://asia.nikkei.com/Spotlight/Podcast’. It has mainly two classes, h2(the giant card that says: Asia Stream: Shinzo Abe’s Assassination and Legacy) and h4 to scrape news articles. With my code, I’ve successfully been able to scrape all the news articles from the h4 class, but for the h2 class, there is some problem as it’s only scraping the article’s title.

My Code

from bs4 import BeautifulSoup as soup
import requests
import pandas as pd
import numpy as np
r = requests.get('https://asia.nikkei.com/Spotlight/Podcast')
b = soup(r.content,'lxml')
for news in b.findAll('h2'):
    print(news.text)
finalisedh2_links = []

for news in b.findAll('h2',{'class':'card-article__headline'}):
    finalisedh2_links.append(news.a['href'])

q = 'https://asia.nikkei.com'
output = ["{}{}".format(q, i) for i in finalisedh2_links]
output
linked_news = []
for link in output:
    page = requests.get(link)
    bsobj = soup(page.content)
    for news in bsobj.findAll('div',{'class':"ezrichtext-field"}):
        linked_news.append(news.text.strip())

linked_news

when I checked linked_news, it showed,

["NEW YORK -- Welcome to Nikkei Asia's podcast: Asia Stream."]

It should scrape the whole news article.I don’t know exactly what problem I’m facing, as this code has scraped all other news articles in h4 class.

Please help me with this.

Asked By: Starlord22

||

Answers:

The data in your pages are dynamically generated. You can scrape all the articles using selenium.

After analyzing the DOM of your page I found that there are 3 parts of an Article, Article header, Article body, Article bottom(footer).

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
import pandas as pd

# REPLACE YOUR CHROME PATH HERE
chrome_path = r"C:UsershpoddarDesktopToolschromedriver_win32chromedriver.exe"

s = Service(chrome_path)
options = webdriver.ChromeOptions()
options.add_argument("--disable-site-isolation-trials")
driver = webdriver.Chrome(service=s, options=options)
driver.get('https://asia.nikkei.com/Spotlight/Podcast')
driver.maximize_window()

while(True):
    try:
        loadmore = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CLASS_NAME, 'load-more')))
        driver.execute_script('''document.getElementsByClassName('load-more')[0].click()''')
    except TimeoutException:
        print("All articles have been loaded")
        break

articles_link = []
all_cards = driver.find_elements(By.CSS_SELECTOR, 'article.card-article .card-article__headline')
for cards in all_cards:
    articles_link.append(cards.find_element(By.TAG_NAME, 'a').get_attribute('href'))

df = pd.DataFrame(columns=['url', 'title', 'article_details', 'article_content'])
for link in articles_link:
    driver.get(link)
    while True:
        try:
            driver.find_element(By.CSS_SELECTOR, '.podcast')
            break
        except:
            driver.refresh()
    title = driver.find_element(By.XPATH, '//h1[@class="article-header__title"]').text
    article_details = driver.find_element(By.XPATH, '//div[@class="article__details"]').text
    article_content = driver.find_element(By.XPATH, '//div[@data-article-body]').text
    df.loc[len(df)] = [link, title, article_details, article_content]

This gives us the expected output in dataframe.

                                                  url  ...                                    article_content
0   https://asia.nikkei.com/Spotlight/Podcast/Asia...  ...  NEW YORK -- Welcome to Nikkei Asia's podcast: ...
1   https://asia.nikkei.com/Spotlight/Podcast/Asia...  ...  NEW YORK -- Welcome to Nikkei Asia's podcast: ...
2   https://asia.nikkei.com/Spotlight/Podcast/Asia...  ...  NEW YORK -- Welcome to Nikkei Asia's podcast: ...
3   https://asia.nikkei.com/Spotlight/Podcast/Asia...  ...  NEW YORK -- Welcome to Nikkei Asia's podcast: ...
4   https://asia.nikkei.com/Spotlight/Podcast/Asia...  ...  NEW YORK -- Welcome to Nikkei Asia's podcast: ...
5   https://asia.nikkei.com/Spotlight/Podcast/Asia...  ...  NEW YORK -- Welcome to Nikkei Asia's podcast: ...
6   https://asia.nikkei.com/Spotlight/Podcast/Asia...  ...  NEW YORK -- Welcome to Nikkei Asia's podcast: ...
7   https://asia.nikkei.com/Spotlight/Podcast/Asia...  ...  NEW YORK -- Welcome to Nikkei Asia's podcast: ...
8   https://asia.nikkei.com/Spotlight/Podcast/Asia...  ...  NEW YORK -- Welcome to Nikkei Asia's podcast: ...
9   https://asia.nikkei.com/Spotlight/Podcast/Asia...  ...  NEW YORK -- Welcome to Nikkei Asia's podcast: ...
10  https://asia.nikkei.com/Spotlight/Podcast/Asia...  ...  NEW YORK -- Welcome to Nikkei Asia's podcast: ...
11  https://asia.nikkei.com/Spotlight/Podcast/Asia...  ...  NEW YORK -- Welcome to Nikkei Asia's podcast: ...
12  https://asia.nikkei.com/Spotlight/Podcast/Asia...  ...  NEW YORK -- Welcome to Nikkei Asia's podcast: ...
13  https://asia.nikkei.com/Spotlight/Podcast/Asia...  ...  NEW YORK -- Welcome to Nikkei Asia's podcast: ...
14  https://asia.nikkei.com/Spotlight/Podcast/Asia...  ...  NEW YORK -- Welcome to Nikkei Asia's podcast: ...
15  https://asia.nikkei.com/Spotlight/Podcast/Asia...  ...  NEW YORK -- Welcome to Nikkei Asia's podcast: ...
16  https://asia.nikkei.com/Spotlight/Podcast/Asia...  ...  NEW YORK -- Welcome to Nikkei Asia's podcast: ...
17  https://asia.nikkei.com/Spotlight/Podcast/Asia...  ...  NEW YORK -- Welcome to Nikkei Asia's podcast: ...
18  https://asia.nikkei.com/Spotlight/Podcast/Asia...  ...  NEW YORK -- Welcome to Nikkei Asia's podcast: ...
19  https://asia.nikkei.com/Spotlight/Podcast/Asia...  ...  NEW YORK -- Welcome to Nikkei Asia's new podca...
20  https://asia.nikkei.com/Spotlight/Podcast/Asia...  ...  NEW YORK -- Welcome to Nikkei Asia's podcast, ...
21  https://asia.nikkei.com/Spotlight/Podcast/Asia...  ...  NEW YORK -- Welcome to Nikkei Asia's new podca...
22  https://asia.nikkei.com/Spotlight/Podcast/Asia...  ...  NEW YORK -- Nikkei Asia is launching a new pod...
23  https://asia.nikkei.com/Spotlight/Podcast/Asia...  ...  NEW YORK -- Welcome to Nikkei Asia's podcast, ...
24  https://asia.nikkei.com/Spotlight/Podcast/Asia...  ...  NEW YORK -- Welcome to Nikkei Asia's new podca...
25  https://asia.nikkei.com/Spotlight/Podcast/Asia...  ...  NEW YORK -- Nikkei Asia is launching a new pod...
26  https://asia.nikkei.com/Spotlight/Podcast/Asia...  ...  NEW YORK -- Welcome to Nikkei Asia's podcast, ...
27  https://asia.nikkei.com/Spotlight/Podcast/Asia...  ...  NEW YORK -- Welcome to Nikkei Asia's new podca...
28  https://asia.nikkei.com/Spotlight/Podcast/Asia...  ...  NEW YORK -- Nikkei Asia is launching a new pod...
29  https://asia.nikkei.com/Spotlight/Podcast/Asia...  ...  NEW YORK -- Welcome to Nikkei Asia's podcast, ...
30  https://asia.nikkei.com/Spotlight/Podcast/Asia...  ...  NEW YORK -- Welcome to Nikkei Asia's new podca...
31  https://asia.nikkei.com/Spotlight/Podcast/Asia...  ...  NEW YORK -- Nikkei Asia is launching a new pod...
32  https://asia.nikkei.com/Spotlight/Podcast/Asia...  ...  NEW YORK -- Welcome to Nikkei Asia's podcast, ...
33  https://asia.nikkei.com/Spotlight/Podcast/Asia...  ...  NEW YORK -- Welcome to Nikkei Asia's new podca...
34  https://asia.nikkei.com/Spotlight/Podcast/Asia...  ...  NEW YORK -- Nikkei Asia is launching a new pod...
Answered By: Himanshu Poddar