Web Scraping a Text Using Python Gives Empty Output

Question

I’m trying to get the affiliation text in this link https://www.sciencedirect.com/science/article/abs/pii/S001191642300142X

These are the elements I work on

<dl class="affiliation"><dt><sup>a</sup></dt><dd>Department of Engineering, Università Campus Bio-Medico di Roma, Via Alvaro del Portillo, 21, 00128 Rome, Italy</dd></dl>

<dl class="affiliation"><dt><sup>b</sup></dt><dd>Department of Chemical Sciences, University of Naples Federico II, Complesso Universitario di Monte Sant'Angelo, 80126 Napoli, Italy</dd></dl>

for some reason I could not get the affiliation text, but I could get the author or title when I tried.

BTW, this website requires a proxy for scraping. So this code will not work but you can see what I did.

I tried:

from bs4 import BeautifulSoup
import requests

response = requests.get('https://www.sciencedirect.com/science/article/abs/pii/S001191642300142X')
print('Response Body: ', response)
soup = BeautifulSoup(response.content.decode('utf-8'), "html.parser")

for aff in soup.find_all('dl', class_='affiliation'):
    affiliation = aff.get_text()
    print(affiliation)

expected output:

Department of Engineering, Università Campus Bio-Medico di Roma, Via Alvaro del Portillo, 21, 00128 Rome, Italy
Department of Chemical Sciences, University of Naples Federico II, Complesso Universitario di Monte Sant'Angelo, 80126 Napoli, Italy

Edit:
When visiting the website, click on ‘Show More’ to see the affiliation text.

Asked By: user17356493

||

Source

Answer 1

1.The desired data getting after clicking on Show more button is rendered by JavaScript. So Beautiful Soup cant’t mimic it.

2.The website is under Cloudflare protection:

https://www.sciencedirect.com/science/article/abs/pii/S001191642300142X is using Cloudflare DNS!

https://www.sciencedirect.com/science/article/abs/pii/S001191642300142X is using Cloudflare CDN/Proxy!

https://www.sciencedirect.com/science/article/abs/pii/S001191642300142X is using Cloudflare SSL!

3.One of the best solution is to apply selenium with bs4 to click on Show more button and solving the Cloudflare issue.

Working code:

import time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
import pandas as pd
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By


options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
#options.add_experimental_option("detach", True)
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
url = 'https://www.sciencedirect.com/science/article/abs/pii/S001191642300142X'
driver.get(url)
time.sleep(3)

driver.find_element(By.XPATH, '//span[@class="button-link-text" and contains(text(), "Show more")]').click()
time.sleep(2)

soup = BeautifulSoup(driver.page_source, "html.parser")

txt = [x.get_text().strip() for x in soup.select('[class="AuthorGroups text-s"] dl dd')]
print(txt)

driver.quit()

Output:

['Department of Engineering, Università Campus Bio-Medico di Roma, Via Alvaro del Portillo, 21, 00128 Rome, Italy', "Department of Chemical Sciences, University of Naples Federico II, Complesso Universitario di Monte Sant'ly', "Department of Chemical Angelo, 80126 Napoli, Italy", 'Department of Chemical Engineering Materials & Environment, Sapienza University  'Department of Chemical Engi
of Rome, Via Eudossiana, 18, 00184 Rome, Italy', 'Department of Chemical Engineering, School of Engineering, Thnt of Chemical Engineering, Se University of Manchester, Oxford Road, Manchester, M13 9PL, United Kingdom']

Answered By: Md. Fazlul Hoque

Web Scraping a Text Using Python Gives Empty Output

Question:

Answers: