Data scraping of an aria-label with beautifulsoup

Question:

From the following, i am trying to extract the analysts price targets.
I am interested in the information present inside the aria-label.

I tried multiple versions of BeautifulSoup I found online with the following setup:

import requests
from bs4 import BeautifulSoup
headers = {'User-Agent' : 'XXXXX'} >> XXXXX replaced with actual
url = 'https://finance.yahoo.com/quote/AAPL/analysis?p=AAPL'
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
  1. The aria-label seems to be between a 'div' and a 'class', so I tried the following:

      target = soup.find('div', {'class':'Px(10px)'})
    

Result = None

  1. It is inside a section, so I tried the following:

      target = soup.find('section', attrs={'data-test':'price-targets'})
    

Result = None

  1. Then I tried to go even upper, using the ID:

      target = soup.find('div', {'id':'mrt-node-Col2-5-QuoteModule'}).find_all('div')[0]
    

Result = < div data-react-checksum="2049647463" data-reactid="1" data-reactroot="" id="Col2-5-QuoteModule-Proxy">< span data-reactid="2">< /span>< /div>

Thus, I am getting closer with option 3, but I receive an error when I modify the find_all div index

Is there any solution or turnaround to extract the 4 data present in the aria-label?

The numbers next to 'Low', 'Current', 'Average' & 'High' are my target.

enter image description here

Asked By: ArthurL

||

Answers:

As @Ann Zen mentioned in commented the website is rendering elements and data dynamically and Beautifulsoup can’t handle it alone using Selenium will wait till the time app is loaded and then try to get the element

Example web-scraping-with-selenium

Answered By: Utpal Dutt

As selenium might consume time to iterate, I found a second possible solution to my issue which is to get the source code of the page using requests, and search for the data with a combination of json & regex.

Answered By: ArthurL

What about yfinance Python package? To get analyst price targets to you need to scroll the page to the end and wait until the data is loaded.

In this case, selenium library is used, which allows you to simulate user actions in the browser.

Install libraries:

pip install bs4 lxml selenium webdriver webdriver_manager

Import libraries:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import time, lxml

For selenium to work, you need to use ChromeDriver, which can be downloaded manually or using code. In our case, the second method is used. To control the start and stop of ChromeDriver, you need to use Service which will install browser binaries under the hood:

service = Service(ChromeDriverManager().install())

You should also add options to work correctly:

options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--lang=en')
options.add_argument('user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36')
  • --headless – to run Chrome in headless mode.
  • --lang=en – to set the browser language to English.
  • user-agent – to act as a "real" user request from the browser by passing it to request headers. Check what’s your user-agent.

Now we can start webdriver and pass the URL to the get() method:

URL = 'https://finance.yahoo.com/quote/AAPL/analysis?p=AAPL'

driver = webdriver.Chrome(service=service, options=options)
driver.get(URL)

The page scrolling algorithm looks like this:

  1. Find out the initial page height and write the result to the old_height variable.
  2. Scroll the page using the script and wait 2 seconds for the data to load.
  3. Find out the new page height and write the result to the new_height variable.
  4. If the variables new_height and old_height are equal, then we complete the algorithm, otherwise we write the value of the variable new_height to the variable old_height and return to step 2.

Getting the page height and scroll is done by pasting the JavaScript code into the execute_script() method:

old_height = driver.execute_script("""
    function getHeight() {
        return document.querySelector('#Aside').scrollHeight;
    }
    return getHeight();
""")

while True:
    driver.execute_script('window.scrollTo(0, document.querySelector("#Aside").scrollHeight);')

    time.sleep(2)

    new_height = driver.execute_script("""
        function getHeight() {
            return document.querySelector('#Aside').scrollHeight;
        }
        return getHeight();
    """)

    if new_height == old_height:
        break

    old_height = new_height

We create the soup object and stop the driver:

soup = BeautifulSoup(driver.page_source, 'lxml')
driver.quit()

Selecting an element by class is not always a good idea because classes can change. It is more reliable to access attributes. In this case, I’m accessing the data-test attribute with the value price-targets and then to the div inside it. The value of the aria-label attribute is retrieved and printed from the resulting object:

price_targets = soup.select_one('[data-test="price-targets"] div').get('aria-label')
print(price_targets)

If you want to extract other data, you can see the Scrape Yahoo! Finance Home Page with Python blog post, which describes this in detail.

Code and full example in online IDE:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import time, lxml

URL = 'https://finance.yahoo.com/quote/AAPL/analysis?p=AAPL'

service = Service(ChromeDriverManager().install())

options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--lang=en')
options.add_argument('user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36')

driver = webdriver.Chrome(service=service, options=options)
driver.get(URL)

old_height = driver.execute_script("""
    function getHeight() {
        return document.querySelector('#Aside').scrollHeight;
    }
    return getHeight();
""")

while True:
    driver.execute_script('window.scrollTo(0, document.querySelector("#Aside").scrollHeight);')

    time.sleep(2)

    new_height = driver.execute_script("""
        function getHeight() {
            return document.querySelector('#Aside').scrollHeight;
        }
        return getHeight();
    """)

    if new_height == old_height:
        break

    old_height = new_height

soup = BeautifulSoup(driver.page_source, 'lxml')
driver.quit()

price_targets = soup.select_one('[data-test="price-targets"] div').get('aria-label')

print(price_targets)

Output:

Low  122 Current  129.62 Average  174.62 High  214
Answered By: Artur Chukhrai