Problem with getting the xlsx tables from a website with BeautifulSoup
Question:
I’m new in python and trying to find the best way to get the xlsx tables for years between 2010 to 2022 from https://www.sba.gov/document/report-sba-disaster-loan-data website. Then, I want to bring those tables together into a single dataframe with a "year" column added indicating the fiscal year of the data. Thank you in advance!
I’ve tried to get all the a href links but it gave me the typeerror below
import requests
from bs4 import BeautifulSoup
web_url = "https://www.sba.gov/document/report-sba-disaster-loan-data"
html = requests.get(web_url).content
soup = BeautifulSoup(html, 'html.parser')
tables = soup.find('div', {'class': 'jHSEzIJePQkATFwBbUD8j'})
for table in tables:
link = cols[1].find('a').get('href')
print(link)
TypeError: ‘NoneType’ object is not iterable
Answers:
The file URLs are loaded from external address via JavaScript. To get the .xlsx
URLs you can use this example:
import re
import requests
url = 'https://www.sba.gov/document/report-sba-disaster-loan-data'
api_url = 'https://www.sba.gov/api/content/{node_id}.json'
html_doc = requests.get(url).text
node_id = re.search(r'nodeId = "(d+)"', html_doc).group(1)
data = requests.get(api_url.format(node_id=node_id)).json()
for f in data['files']:
print(f['effectiveDate'], 'https://www.sba.gov' + f['fileUrl'])
Prints:
2022-02-11 https://www.sba.gov/sites/default/files/2022-07/SBA_Disaster_Loan_Data_FY21.xlsx
2021-03-15 https://www.sba.gov/sites/default/files/2022-07/SBA_Disaster_Loan_Data_FY20.xlsx
2020-04-10 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY19.xlsx
2019-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY18.xlsx
2018-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY17_Update_033118.xlsx
2017-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY16.xlsx
2016-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY15.xlsx
2015-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY14.xlsx
2014-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY13.xlsx
2014-09-23 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_Superstorm_Sandy.xlsx
2013-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY12.xlsx
2012-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY11.xlsx
2011-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY10.xlsx
2010-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY09.xlsx
2009-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY08.xlsx
2008-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY07.xlsx
2007-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY06.xlsx
2006-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY05.xlsx
2005-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY04.xlsx
2004-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY03.xls
2003-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY02.xls
2002-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY01.xls
2001-10-01 https://www.sba.gov/sites/default/files/2021-05/SBA_Disaster_Loan_Data_FY00.xlsx
To get a pandas dataframe you can store the dates and urls into a list and use e.g. pandas.read_excel
function.
I’m new in python and trying to find the best way to get the xlsx tables for years between 2010 to 2022 from https://www.sba.gov/document/report-sba-disaster-loan-data website. Then, I want to bring those tables together into a single dataframe with a "year" column added indicating the fiscal year of the data. Thank you in advance!
I’ve tried to get all the a href links but it gave me the typeerror below
import requests
from bs4 import BeautifulSoup
web_url = "https://www.sba.gov/document/report-sba-disaster-loan-data"
html = requests.get(web_url).content
soup = BeautifulSoup(html, 'html.parser')
tables = soup.find('div', {'class': 'jHSEzIJePQkATFwBbUD8j'})
for table in tables:
link = cols[1].find('a').get('href')
print(link)
TypeError: ‘NoneType’ object is not iterable
The file URLs are loaded from external address via JavaScript. To get the .xlsx
URLs you can use this example:
import re
import requests
url = 'https://www.sba.gov/document/report-sba-disaster-loan-data'
api_url = 'https://www.sba.gov/api/content/{node_id}.json'
html_doc = requests.get(url).text
node_id = re.search(r'nodeId = "(d+)"', html_doc).group(1)
data = requests.get(api_url.format(node_id=node_id)).json()
for f in data['files']:
print(f['effectiveDate'], 'https://www.sba.gov' + f['fileUrl'])
Prints:
2022-02-11 https://www.sba.gov/sites/default/files/2022-07/SBA_Disaster_Loan_Data_FY21.xlsx
2021-03-15 https://www.sba.gov/sites/default/files/2022-07/SBA_Disaster_Loan_Data_FY20.xlsx
2020-04-10 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY19.xlsx
2019-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY18.xlsx
2018-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY17_Update_033118.xlsx
2017-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY16.xlsx
2016-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY15.xlsx
2015-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY14.xlsx
2014-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY13.xlsx
2014-09-23 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_Superstorm_Sandy.xlsx
2013-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY12.xlsx
2012-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY11.xlsx
2011-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY10.xlsx
2010-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY09.xlsx
2009-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY08.xlsx
2008-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY07.xlsx
2007-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY06.xlsx
2006-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY05.xlsx
2005-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY04.xlsx
2004-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY03.xls
2003-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY02.xls
2002-10-01 https://www.sba.gov/sites/default/files/2020-06/SBA_Disaster_Loan_Data_FY01.xls
2001-10-01 https://www.sba.gov/sites/default/files/2021-05/SBA_Disaster_Loan_Data_FY00.xlsx
To get a pandas dataframe you can store the dates and urls into a list and use e.g. pandas.read_excel
function.