How to scrape content from a complex div class using Beautiful Soup
Question:
I’m making some exercises to practice web scraping using Python and I would like to get the values of the first row ("Total Revenue") of the table of this Yahoo page (Bank of America at NYSE).
Looking at the page source, my idea is to find the first occurrence of <div class="" data-test="fin-row">
and get the values but I’m not sure how to navigate inside the first div.
Below I present the HTML code where the first row is presented:
<div class="" data-test="fin-row">
<div class="D(tbr) fi-row Bgc($hoverBgColor):h">
<div class="D(tbc) Ta(start) Pend(15px)--mv2 Pend(10px) Bxz(bb) Py(8px) Bdends(s) Bdbs(s) Bdstarts(s) Bdstartw(1px) Bdbw(1px) Bdendw(1px) Bdc($seperatorColor) Pos(st) Start(0) Bgc($lv2BgColor) fi-row:h_Bgc($hoverBgColor) Pstart(15px)--mv2 Pstart(10px)">
<div class="D(ib) Va(m) Ell Mt(-3px) W(215px)--mv2 W(200px) undefined" title="Total Revenue">
<button aria-label="Total Revenue" class="P(0) M(0) Va(m) Bd(0) Fz(s) Mend(2px) tgglBtn">
<svg class="H(16px) Fill($primaryColor) Stk($primaryColor) tgglBtn:h_Fill($linkColor) tgglBtn:h_Stk($linkColor) Cur(p)" width="16" style="stroke-width:0;vertical-align:bottom" height="16" viewBox="0 0 48 48" data-icon="caret-right">
<path d="M33.447 24.102L20.72 11.375c-.78-.78-2.048-.78-2.828 0-.78.78-.78 2.047 0 2.828l9.9 9.9-9.9 9.9c-.78.78-.78 2.047 0 2.827.78.78 2.047.78 2.828 0l12.727-12.728z"></path>
</svg>
</button>
<span class="Va(m)">Total Revenue</span>
</div>
<div class="W(3px) Pos(a) Start(100%) T(0) H(100%) Bg($pfColumnFakeShadowGradient) Pe(n) Pend(5px)"></div>
</div>
<div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg Bgc($lv1BgColor) fi-row:h_Bgc($hoverBgColor) D(tbc)" data-test="fin-col"><span>90,742,000</span></div>
<div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg D(tbc)" data-test="fin-col"><span>89,113,000</span></div>
<div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg Bgc($lv1BgColor) fi-row:h_Bgc($hoverBgColor) D(tbc)" data-test="fin-col"><span>85,528,000</span></div>
<div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg D(tbc)" data-test="fin-col"><span>91,244,000</span></div>
<div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg Bgc($lv1BgColor) fi-row:h_Bgc($hoverBgColor) D(tbc)" data-test="fin-col"><span>91,247,000</span></div>
</div>
<div></div>
In my code I’m using Selenium to process the page. Not sure if its the best way but with other libraries like urlopen I wasn’t able to see the HTML content. I’m able to open the page, click the accept button, but after that I’m not sure how to navigate inside the first div. I’m actually getting an error like: "AttributeError: ‘NoneType’ object has no attribute ‘get_text’"
import requests
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Firefox()
url = "https://finance.yahoo.com/quote/BAC/financials?p=BAC"
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
# Click accept button
aceitar = driver.find_element(By.NAME, "agree")
aceitar.click()
# Find the div of the Revenue row <div class="" data-test="fin-row">
primeiraLinha = soup.find("div", {"class":""})
print(primeiraLinha.get_text())
BTW, I think Selenium make this process very slow.
Answers:
Here is a Selenium solution to get the entire table in a pandas dataframe.
Imports required
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
Start web driver
# Replace your CHROME DRIVER path here
chrome_path = r"C:UsershpoddarDesktopToolschromedriver_win32chromedriver.exe"
s = Service(chrome_path)
driver = webdriver.Chrome(service=s)
Fetch the page
driver.get('https://finance.yahoo.com/quote/BAC/financials?p=BAC')
Wait for table to load
WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, '//div[@class="D(tbhg)"]')))
Get the header row
headers_elem = driver.find_elements(By.XPATH, '//div[@class="D(tbhg)"]/div/div')
col_headers = [header.text for header in headers_elem]
df = pd.DataFrame(columns = col_headers)
df
Output:
Empty DataFrame
Columns: [Breakdown, TTM, 12/30/2021, 12/30/2020, 12/30/2019, 12/30/2018]
Index: []
Get the rows from the table
Here each row in the table is stored in rows
rows = driver.find_elements(By.XPATH, '//div[@class="D(tbrg)"]//div[@data-test="fin-row"]')
for row in rows:
row_values = row.find_elements(By.XPATH, 'div/div')
df.loc[len(df)] = [row_value.text for row_value in row_values]
Output:
Which gives us the expected output:
Breakdown
TTM
12/30/2021
12/30/2020
12/30/2019
12/30/2018
0
Total Revenue
90,742,000
89,113,000
85,528,000
91,244,000
91,247,000
1
Credit Losses Provision
560,000
4,594,000
-11,320,000
-3,590,000
-3,282,000
2
Non Interest Expense
59,763,000
59,731,000
55,213,000
54,900,000
53,381,000
3
Special Income Charges
–
–
–
–
0
4
Pretax Income
31,539,000
33,976,000
18,995,000
32,754,000
34,584,000
5
Tax Provision
3,521,000
1,998,000
1,101,000
5,324,000
6,437,000
6
Net Income Common Stockholders
26,565,000
30,557,000
16,473,000
25,998,000
26,696,000
7
Diluted NI Available to Com Stockholders
26,565,000
30,557,000
16,473,000
25,998,000
26,696,000
8
Basic EPS
–
3.60
1.88
2.77
2.64
9
Diluted EPS
–
3.57
1.87
2.75
2.61
10
Basic Average Shares
–
8,493,300
8,753,200
9,390,500
10,096,500
11
Diluted Average Shares
–
8,558,400
8,796,900
9,442,900
10,236,900
12
INTEREST_INCOME_AFTER_PROVISION_FOR_LOAN_LOSS
47,080,000
47,528,000
32,040,000
45,301,000
44,150,000
13
Net Income from Continuing & Discontinued Operation
28,018,000
31,978,000
17,894,000
27,430,000
28,147,000
14
Normalized Income
28,018,000
31,978,000
17,894,000
27,430,000
28,147,000
15
Total Money Market Investments
348,000
-90,000
903,000
4,843,000
3,176,000
16
Reconciled Depreciation
1,953,000
1,898,000
1,843,000
1,729,000
2,063,000
17
Net Income from Continuing Operation Net Minority Interest
28,018,000
31,978,000
17,894,000
27,430,000
28,147,000
18
Total Unusual Items Excluding Goodwill
–
–
–
–
0
19
Total Unusual Items
–
–
–
–
0
20
Tax Rate for Calcs
0
0
0
0
0
21
Tax Effect of Unusual Items
0
0
0
0
0
TL:DR
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
chrome_path = r"C:UsershpoddarDesktopToolschromedriver_win32chromedriver.exe"
s = Service(chrome_path)
driver = webdriver.Chrome(service=s)
driver.get('https://finance.yahoo.com/quote/BAC/financials?p=BAC')
WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, '//div[@class="D(tbhg)"]')))
headers_elem = driver.find_elements(By.XPATH, '//div[@class="D(tbhg)"]/div/div')
col_headers = [header.text for header in headers_elem]
df = pd.DataFrame(columns = col_headers)
rows = driver.find_elements(By.XPATH, '//div[@class="D(tbrg)"]//div[@data-test="fin-row"]')
for row in rows:
row_values = row.find_elements(By.XPATH, 'div/div')
df.loc[len(df)] = [row_value.text for row_value in row_values]
The result is stored in df.
I’m making some exercises to practice web scraping using Python and I would like to get the values of the first row ("Total Revenue") of the table of this Yahoo page (Bank of America at NYSE).
Looking at the page source, my idea is to find the first occurrence of <div class="" data-test="fin-row">
and get the values but I’m not sure how to navigate inside the first div.
Below I present the HTML code where the first row is presented:
<div class="" data-test="fin-row">
<div class="D(tbr) fi-row Bgc($hoverBgColor):h">
<div class="D(tbc) Ta(start) Pend(15px)--mv2 Pend(10px) Bxz(bb) Py(8px) Bdends(s) Bdbs(s) Bdstarts(s) Bdstartw(1px) Bdbw(1px) Bdendw(1px) Bdc($seperatorColor) Pos(st) Start(0) Bgc($lv2BgColor) fi-row:h_Bgc($hoverBgColor) Pstart(15px)--mv2 Pstart(10px)">
<div class="D(ib) Va(m) Ell Mt(-3px) W(215px)--mv2 W(200px) undefined" title="Total Revenue">
<button aria-label="Total Revenue" class="P(0) M(0) Va(m) Bd(0) Fz(s) Mend(2px) tgglBtn">
<svg class="H(16px) Fill($primaryColor) Stk($primaryColor) tgglBtn:h_Fill($linkColor) tgglBtn:h_Stk($linkColor) Cur(p)" width="16" style="stroke-width:0;vertical-align:bottom" height="16" viewBox="0 0 48 48" data-icon="caret-right">
<path d="M33.447 24.102L20.72 11.375c-.78-.78-2.048-.78-2.828 0-.78.78-.78 2.047 0 2.828l9.9 9.9-9.9 9.9c-.78.78-.78 2.047 0 2.827.78.78 2.047.78 2.828 0l12.727-12.728z"></path>
</svg>
</button>
<span class="Va(m)">Total Revenue</span>
</div>
<div class="W(3px) Pos(a) Start(100%) T(0) H(100%) Bg($pfColumnFakeShadowGradient) Pe(n) Pend(5px)"></div>
</div>
<div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg Bgc($lv1BgColor) fi-row:h_Bgc($hoverBgColor) D(tbc)" data-test="fin-col"><span>90,742,000</span></div>
<div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg D(tbc)" data-test="fin-col"><span>89,113,000</span></div>
<div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg Bgc($lv1BgColor) fi-row:h_Bgc($hoverBgColor) D(tbc)" data-test="fin-col"><span>85,528,000</span></div>
<div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg D(tbc)" data-test="fin-col"><span>91,244,000</span></div>
<div class="Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg Bgc($lv1BgColor) fi-row:h_Bgc($hoverBgColor) D(tbc)" data-test="fin-col"><span>91,247,000</span></div>
</div>
<div></div>
In my code I’m using Selenium to process the page. Not sure if its the best way but with other libraries like urlopen I wasn’t able to see the HTML content. I’m able to open the page, click the accept button, but after that I’m not sure how to navigate inside the first div. I’m actually getting an error like: "AttributeError: ‘NoneType’ object has no attribute ‘get_text’"
import requests
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Firefox()
url = "https://finance.yahoo.com/quote/BAC/financials?p=BAC"
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
# Click accept button
aceitar = driver.find_element(By.NAME, "agree")
aceitar.click()
# Find the div of the Revenue row <div class="" data-test="fin-row">
primeiraLinha = soup.find("div", {"class":""})
print(primeiraLinha.get_text())
BTW, I think Selenium make this process very slow.
Here is a Selenium solution to get the entire table in a pandas dataframe.
Imports required
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
Start web driver
# Replace your CHROME DRIVER path here
chrome_path = r"C:UsershpoddarDesktopToolschromedriver_win32chromedriver.exe"
s = Service(chrome_path)
driver = webdriver.Chrome(service=s)
Fetch the page
driver.get('https://finance.yahoo.com/quote/BAC/financials?p=BAC')
Wait for table to load
WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, '//div[@class="D(tbhg)"]')))
Get the header row
headers_elem = driver.find_elements(By.XPATH, '//div[@class="D(tbhg)"]/div/div')
col_headers = [header.text for header in headers_elem]
df = pd.DataFrame(columns = col_headers)
df
Output:
Empty DataFrame
Columns: [Breakdown, TTM, 12/30/2021, 12/30/2020, 12/30/2019, 12/30/2018]
Index: []
Get the rows from the table
Here each row in the table is stored in rows
rows = driver.find_elements(By.XPATH, '//div[@class="D(tbrg)"]//div[@data-test="fin-row"]')
for row in rows:
row_values = row.find_elements(By.XPATH, 'div/div')
df.loc[len(df)] = [row_value.text for row_value in row_values]
Output:
Which gives us the expected output:
Breakdown | TTM | 12/30/2021 | 12/30/2020 | 12/30/2019 | 12/30/2018 | |
---|---|---|---|---|---|---|
0 | Total Revenue | 90,742,000 | 89,113,000 | 85,528,000 | 91,244,000 | 91,247,000 |
1 | Credit Losses Provision | 560,000 | 4,594,000 | -11,320,000 | -3,590,000 | -3,282,000 |
2 | Non Interest Expense | 59,763,000 | 59,731,000 | 55,213,000 | 54,900,000 | 53,381,000 |
3 | Special Income Charges | – | – | – | – | 0 |
4 | Pretax Income | 31,539,000 | 33,976,000 | 18,995,000 | 32,754,000 | 34,584,000 |
5 | Tax Provision | 3,521,000 | 1,998,000 | 1,101,000 | 5,324,000 | 6,437,000 |
6 | Net Income Common Stockholders | 26,565,000 | 30,557,000 | 16,473,000 | 25,998,000 | 26,696,000 |
7 | Diluted NI Available to Com Stockholders | 26,565,000 | 30,557,000 | 16,473,000 | 25,998,000 | 26,696,000 |
8 | Basic EPS | – | 3.60 | 1.88 | 2.77 | 2.64 |
9 | Diluted EPS | – | 3.57 | 1.87 | 2.75 | 2.61 |
10 | Basic Average Shares | – | 8,493,300 | 8,753,200 | 9,390,500 | 10,096,500 |
11 | Diluted Average Shares | – | 8,558,400 | 8,796,900 | 9,442,900 | 10,236,900 |
12 | INTEREST_INCOME_AFTER_PROVISION_FOR_LOAN_LOSS | 47,080,000 | 47,528,000 | 32,040,000 | 45,301,000 | 44,150,000 |
13 | Net Income from Continuing & Discontinued Operation | 28,018,000 | 31,978,000 | 17,894,000 | 27,430,000 | 28,147,000 |
14 | Normalized Income | 28,018,000 | 31,978,000 | 17,894,000 | 27,430,000 | 28,147,000 |
15 | Total Money Market Investments | 348,000 | -90,000 | 903,000 | 4,843,000 | 3,176,000 |
16 | Reconciled Depreciation | 1,953,000 | 1,898,000 | 1,843,000 | 1,729,000 | 2,063,000 |
17 | Net Income from Continuing Operation Net Minority Interest | 28,018,000 | 31,978,000 | 17,894,000 | 27,430,000 | 28,147,000 |
18 | Total Unusual Items Excluding Goodwill | – | – | – | – | 0 |
19 | Total Unusual Items | – | – | – | – | 0 |
20 | Tax Rate for Calcs | 0 | 0 | 0 | 0 | 0 |
21 | Tax Effect of Unusual Items | 0 | 0 | 0 | 0 | 0 |
TL:DR
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
chrome_path = r"C:UsershpoddarDesktopToolschromedriver_win32chromedriver.exe"
s = Service(chrome_path)
driver = webdriver.Chrome(service=s)
driver.get('https://finance.yahoo.com/quote/BAC/financials?p=BAC')
WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, '//div[@class="D(tbhg)"]')))
headers_elem = driver.find_elements(By.XPATH, '//div[@class="D(tbhg)"]/div/div')
col_headers = [header.text for header in headers_elem]
df = pd.DataFrame(columns = col_headers)
rows = driver.find_elements(By.XPATH, '//div[@class="D(tbrg)"]//div[@data-test="fin-row"]')
for row in rows:
row_values = row.find_elements(By.XPATH, 'div/div')
df.loc[len(df)] = [row_value.text for row_value in row_values]
The result is stored in df.