Getting lazy loaded images while scraping

Question:

I am trying to scrape the images of this website, but I am unable to get the images src and rather getting the lazy loading src attribute of the images.

import urllib.request
from urllib.request import urlopen
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
import time

url = "https://www.espncricinfo.com/series/indian-premier-league-2022-1298423/squads"
s = Service("M:WebScrapingchromedriver.exe")

driver = webdriver.Chrome(service=s)
driver.maximize_window()
driver.get(url)
time.sleep(5)
driver.execute_script("window.scrollTo(0, 500);")

page = urllib.request.urlopen(url)
doc = BeautifulSoup(page, "html.parser")

teams = doc.find(class_="ds-p-0").find(class_="ds-mb-4")

for team in teams:
    print(team.img["src"])
    file_name = team.img["alt"]
    img_file = open(file_name + ".png", "wb")
    img_file.write(urllib.request.urlopen(team.img["src"]).read())
    img_file.close()

This is the output I am receiving. (Which are just lazy loaded images)

https://wassets.hscicdn.com/static/images/lazyimage-noaspect.svg
https://wassets.hscicdn.com/static/images/lazyimage-noaspect.svg
https://wassets.hscicdn.com/static/images/lazyimage-noaspect.svg
https://wassets.hscicdn.com/static/images/lazyimage-noaspect.svg
https://wassets.hscicdn.com/static/images/lazyimage-noaspect.svg
https://wassets.hscicdn.com/static/images/lazyimage-noaspect.svg

But I rather want to get the actual src of the image as in these –

https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/333800/333885.png
Asked By: Rayyan Alam

||

Answers:

BeautifulSoup is not able to load javascript and other stuff, that’s why when you run

page = urllib.request.urlopen(url)
doc = BeautifulSoup(page, "html.parser")

you get the lazy image links. On the other side, Selenium is able to load almost everything, so you can load the page with Selenium and then pass its page source to BeautifulSoup as parameter instead of the url:

doc = BeautifulSoup(driver.page_source, "html.parser")

In this way BeautifulSoup will use the full HTML of the page. The following code prints the urls both with Selenium and BeautifulSoup, so that you can see both techniques.

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
    
chromedriver_path = '...'
driver = webdriver.Chrome(service=Service(chromedriver_path), options=options)

url = "https://www.espncricinfo.com/series/indian-premier-league-2022-1298423/squads"
driver.get(url)

# wait (up to 20 seconds) until the images are visible on page
images = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, ".ds-p-0 .ds-mb-4 img")))
# scroll to the last image, so that all images get rendered correctly
driver.execute_script('arguments[0].scrollIntoView({block: "center", behavior: "smooth"});', images[-1])
time.sleep(2)

# PRINT URLS USING SELENIUM

print('Selenium')
for img in images:
    print(img.get_attribute('src'))

# PRINT URLS USING BEAUTIFULSOUP

doc = BeautifulSoup(driver.page_source, "html.parser")
teams = doc.find(class_="ds-p-0").find(class_="ds-mb-4")

print('BeautifulSoup')
for team in teams:
    print(team.img["src"])

Output

Selenium 
https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/313400/313421.logo.png
https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/313400/313422.logo.png
https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/334700/334707.png
https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/313400/313419.logo.png
https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/333800/333885.png
https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/344000/344062.png
https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/317000/317003.png
https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/313400/313423.logo.png
https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/313400/313418.logo.png
https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/313400/313480.logo.png

BeautifulSoup
https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/313400/313421.logo.png
https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/313400/313422.logo.png
https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/334700/334707.png
https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/313400/313419.logo.png
https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/333800/333885.png
https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/344000/344062.png
https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/317000/317003.png
https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/313400/313423.logo.png
https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/313400/313418.logo.png
https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/313400/313480.logo.png
Answered By: sound wave