How to scrape content from a complex div class using Beautiful Soup

Question:

I’m making some exercises to practice web scraping using Python and I would like to get the values of the first row ("Total Revenue") of the table of this Yahoo page (Bank of America at NYSE).

Looking at the page source, my idea is to find the first occurrence of <div class="" data-test="fin-row"> and get the values but I’m not sure how to navigate inside the first div.

Below I present the HTML code where the first row is presented:

<div class="" data-test="fin-row">
    <div class="D(tbr) fi-row Bgc($hoverBgColor):h">
        <div class="D(tbc) Ta(start) Pend(15px)--mv2 Pend(10px) Bxz(bb) Py(8px) Bdends(s) Bdbs(s) Bdstarts(s) Bdstartw(1px) Bdbw(1px) Bdendw(1px) Bdc($seperatorColor) Pos(st) Start(0) Bgc($lv2BgColor) fi-row:h_Bgc($hoverBgColor) Pstart(15px)--mv2 Pstart(10px)">
            <div class="D(ib) Va(m) Ell Mt(-3px) W(215px)--mv2 W(200px) undefined" title="Total Revenue">
                <button aria-label="Total Revenue" class="P(0) M(0) Va(m) Bd(0) Fz(s) Mend(2px) tgglBtn">
                    <svg class="H(16px) Fill($primaryColor) Stk($primaryColor) tgglBtn:h_Fill($linkColor) tgglBtn:h_Stk($linkColor) Cur(p)" width="16" style="stroke-width:0;vertical-align:bottom" height="16" viewBox="0 0 48 48" data-icon="caret-right">
                        <path d="M33.447 24.102L20.72 11.375c-.78-.78-2.048-.78-2.828 0-.78.78-.78 2.047 0 2.828l9.9 9.9-9.9 9.9c-.78.78-.78 2.047 0 2.827.78.78 2.047.78 2.828 0l12.727-12.728z"></path>
                    </svg>
                </button>
                <span class="Va(m)">Total Revenue</span>
            </div>
            <div class="W(3px) Pos(a) Start(100%) T(0) H(100%) Bg($pfColumnFakeShadowGradient) Pe(n) Pend(5px)"></div>
        </div>
        <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg Bgc($lv1BgColor) fi-row:h_Bgc($hoverBgColor) D(tbc)" data-test="fin-col"><span>90,742,000</span></div>
        <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg D(tbc)" data-test="fin-col"><span>89,113,000</span></div>
        <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg Bgc($lv1BgColor) fi-row:h_Bgc($hoverBgColor) D(tbc)" data-test="fin-col"><span>85,528,000</span></div>
        <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg D(tbc)" data-test="fin-col"><span>91,244,000</span></div>
        <div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg Bgc($lv1BgColor) fi-row:h_Bgc($hoverBgColor) D(tbc)" data-test="fin-col"><span>91,247,000</span></div>
    </div>

    <div></div>

In my code I’m using Selenium to process the page. Not sure if its the best way but with other libraries like urlopen I wasn’t able to see the HTML content. I’m able to open the page, click the accept button, but after that I’m not sure how to navigate inside the first div. I’m actually getting an error like: "AttributeError: ‘NoneType’ object has no attribute ‘get_text’"

import requests
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By

driver = webdriver.Firefox()
url = "https://finance.yahoo.com/quote/BAC/financials?p=BAC"
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")

# Click accept button
aceitar = driver.find_element(By.NAME, "agree")
aceitar.click()

# Find the div of the Revenue row <div class="" data-test="fin-row">
primeiraLinha = soup.find("div", {"class":""})
print(primeiraLinha.get_text())

BTW, I think Selenium make this process very slow.

Asked By: rcmv

||

Answers:

Here is a Selenium solution to get the entire table in a pandas dataframe.

Imports required

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

Start web driver

# Replace your CHROME DRIVER path here
chrome_path = r"C:UsershpoddarDesktopToolschromedriver_win32chromedriver.exe"
s = Service(chrome_path)
driver = webdriver.Chrome(service=s)

Fetch the page

driver.get('https://finance.yahoo.com/quote/BAC/financials?p=BAC')

Wait for table to load

WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, '//div[@class="D(tbhg)"]')))

Get the header row

headers_elem = driver.find_elements(By.XPATH, '//div[@class="D(tbhg)"]/div/div')
col_headers = [header.text for header in headers_elem]
df = pd.DataFrame(columns = col_headers)
df

Output:

Empty DataFrame
Columns: [Breakdown, TTM, 12/30/2021, 12/30/2020, 12/30/2019, 12/30/2018]
Index: []

Get the rows from the table

Here each row in the table is stored in rows

rows = driver.find_elements(By.XPATH, '//div[@class="D(tbrg)"]//div[@data-test="fin-row"]')
for row in rows:
    row_values = row.find_elements(By.XPATH, 'div/div')
    df.loc[len(df)] = [row_value.text for row_value in row_values]

Output:

Which gives us the expected output:

Breakdown TTM 12/30/2021 12/30/2020 12/30/2019 12/30/2018
0 Total Revenue 90,742,000 89,113,000 85,528,000 91,244,000 91,247,000
1 Credit Losses Provision 560,000 4,594,000 -11,320,000 -3,590,000 -3,282,000
2 Non Interest Expense 59,763,000 59,731,000 55,213,000 54,900,000 53,381,000
3 Special Income Charges 0
4 Pretax Income 31,539,000 33,976,000 18,995,000 32,754,000 34,584,000
5 Tax Provision 3,521,000 1,998,000 1,101,000 5,324,000 6,437,000
6 Net Income Common Stockholders 26,565,000 30,557,000 16,473,000 25,998,000 26,696,000
7 Diluted NI Available to Com Stockholders 26,565,000 30,557,000 16,473,000 25,998,000 26,696,000
8 Basic EPS 3.60 1.88 2.77 2.64
9 Diluted EPS 3.57 1.87 2.75 2.61
10 Basic Average Shares 8,493,300 8,753,200 9,390,500 10,096,500
11 Diluted Average Shares 8,558,400 8,796,900 9,442,900 10,236,900
12 INTEREST_INCOME_AFTER_PROVISION_FOR_LOAN_LOSS 47,080,000 47,528,000 32,040,000 45,301,000 44,150,000
13 Net Income from Continuing & Discontinued Operation 28,018,000 31,978,000 17,894,000 27,430,000 28,147,000
14 Normalized Income 28,018,000 31,978,000 17,894,000 27,430,000 28,147,000
15 Total Money Market Investments 348,000 -90,000 903,000 4,843,000 3,176,000
16 Reconciled Depreciation 1,953,000 1,898,000 1,843,000 1,729,000 2,063,000
17 Net Income from Continuing Operation Net Minority Interest 28,018,000 31,978,000 17,894,000 27,430,000 28,147,000
18 Total Unusual Items Excluding Goodwill 0
19 Total Unusual Items 0
20 Tax Rate for Calcs 0 0 0 0 0
21 Tax Effect of Unusual Items 0 0 0 0 0

TL:DR

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

chrome_path = r"C:UsershpoddarDesktopToolschromedriver_win32chromedriver.exe"
s = Service(chrome_path)
driver = webdriver.Chrome(service=s)

driver.get('https://finance.yahoo.com/quote/BAC/financials?p=BAC')

WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, '//div[@class="D(tbhg)"]')))

headers_elem = driver.find_elements(By.XPATH, '//div[@class="D(tbhg)"]/div/div')
col_headers = [header.text for header in headers_elem]
df = pd.DataFrame(columns = col_headers)

rows = driver.find_elements(By.XPATH, '//div[@class="D(tbrg)"]//div[@data-test="fin-row"]')
for row in rows:
    row_values = row.find_elements(By.XPATH, 'div/div')
    df.loc[len(df)] = [row_value.text for row_value in row_values]

The result is stored in df.

Answered By: Himanshuman
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.