Can't scrape <h3> tag from page

Question

Seems like i can scrape any tag and class, except h3 on this page. It keeps returning None or an empty list. I’m trying to get this h3 tag:

page source with highlighted target tag

…on the following webpage:

https://www.empireonline.com/movies/features/best-movies-2/

And this is the code I use:

from bs4 import BeautifulSoup
import requests
from pprint import pprint
from bs4 import BeautifulSoup

URL = "https://www.empireonline.com/movies/features/best-movies-2/"

response = requests.get(URL)
web_html = response.text

soup = BeautifulSoup(web_html, "html.parser")

movies = soup.findAll(name = "h3" , class_ = "jsx-4245974604")

movies_text=[]

for item in movies:
    result = item.getText()
    movies_text.append(result)

print(movies_text)

Can you please help with the solution for this problem?

Asked By: Denis Culic

||

Source

Answer 1

As other people mentioned this is dynamic content, which needs to be generated first when opening/running the webpage. Therefore you can’t find the class "jsx-4245974604" with BS4.

If you print out your "soup" variable you actually can see that you won’t find it. But if simply you want to get the names of the movies you can just use another part of the html in this case.

The movie name is in the alt tag of the picture (and actually also in many other parts of the html).

import requests

from pprint import pprint

from bs4 import BeautifulSoup

URL = "https://www.empireonline.com/movies/features/best-movies-2/"

response = requests.get(URL) 
web_html = response.text

soup = BeautifulSoup(web_html, "html.parser")


movies = soup.findAll("img", class_="jsx-952983560")

movies_text=[]

for item in movies: 
  result = item.get('alt')
  movies_text.append(result)

print(movies_text)

If you run into this issue in the future, remember to just print out the initial html you can get with soup and just check by eye if the information you need can be found.

Answered By: David

Answer 2

To scrape data from jsx you actually need a scrapper like selenium webdriver.

You should firstly download and install selenium webdriver for your browser.
Below the solution for Chrome browser:

https://chromedriver.chromium.org/downloads

And just tested code below:

from selenium import webdriver
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()
driver.get("https://www.empireonline.com/movies/features/best-movies-2/")
sel_element = driver.find_elements(By.TAG_NAME, "h3")
new_list = []
for element in sel_element:
    text = element.text
    new_list.append(text)

for item in new_list[::-1]: # revers from last to 1st element(because 1st element is 100)
    with open("100_movies.txt", mode="a", encoding="utf-8") as file:
        file.write(f"{item}n")

Answered By: Teslajke

Can't scrape <h3> tag from page

Question:

Answers: