NoneType' object is not subscriptable : bs4 task fails permanently
Question:
update: tried the scripts of Driftr95 .. in google-colab – and got some questions – the scripts failed – and was not succesful – queston. at the beginning of the scripts i have noticed that some lines are commendted out. why is this so. i will try to investigate more – meanwhile thanks alot..-. awesome.
two ideas that come up to mind:
a. the whole site of a result-page contains even more data: see
see the results of one (of 700 ) pages:
the digital innovation hub: 4PDIH – Public Private People Partnership Digital Innovation Hub
the dataset with the categories:
Hub Information
Description
Contact Data
Organisation
Technologies
Market and Services
Service Examples
Funding
Customers
Partners
second idea: with the awesme-scripts:
NameError("name 'pd' is not defined") from https://s3platform.jrc.ec.europa.eu/digital-innovation-hubs-tool?_eu_europa_ec_jrc_dih_web_DihWebPortlet_cur=1
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-3-c1f39e3c2547> in <module>
11
12 # pd.concat(dfList).to_csv(output_fp, index=False) ## save without page numbers
---> 13 df=pd.concat(dfList, keys=list(range(1,len(dfList)+1)),names=['from_pg','pgi'])
14 df.reset_index().drop('pgi',axis='columns').to_csv(output_fp, index=False)
NameError: name 'pd' is not defined
and besides that – in the next trial
NameError Traceback (most recent call last)
<ipython-input-5-538670405002> in <module>
1 # df = pd.concat(dfList.....
----> 2 orig_cols = list(df.columns)
3 for ocn in orig_cols:
4 if any(vals:=[cv for cv,*_ in df[ocn]]): df[ocn[0]] = vals
5 if any(links:=[c[1] for c in df[ocn]]): df[ocn[0].split()[0]+' Links'] = links
NameError: name 'df' is not defined
and the next trail
NameError Traceback (most recent call last)
<ipython-input-1-4a00208c3fe6> in <module>
10 pg_num += 1
11 if isinstance(max_pg, int) and pg_num>max_pg: break
---> 12 pgSoup = BeautifulSoup((pgReq:=requests.get(next_link)).content, 'lxml')
13 rows = pgSoup.select('tr:has(td[data-ecl-table-header])')
14 all_rows += [{'from_pg': pg_num, **get_row_dict(r)} for r in rows]
NameError: name 'BeautifulSoup' is not defined
end of the update:
the full story:
I am currently trying to learn Beautiful Soup (BS4), starting with fetching data.
With a scraper that should work with beautiful soup and scrapes the dataset of this page and puts the data into a csv format or uses pandas. If I run this in Google colab – I am facing some weird issues. See below:
import requests
from bs4 import BeautifulSoup
import pandas as pd
# Make a request to the webpage
url = 'https://s3platform.jrc.ec.europa.eu/digital-innovation-hubs-tool'
response = requests.get(url)
# Parse the HTML content with Beautiful Soup
soup = BeautifulSoup(response.content, 'html.parser')
# Find the table with the data
table = soup.find('table')
# Extract the table headers
headers = []
for th in table.find_all('th'):
headers.append(th.text.strip())
# Extract the table rows
rows = []
for tr in table.find_all('tr')[1:]:
row = []
for td in tr.find_all('td'):
row.append(td.text.strip())
rows.append(row)
# Find the total number of pages
num_pages = soup.find('input', {'id': 'paginationPagesNum'})['value']
# Loop through each page and extract the data
for page in range(2, int(num_pages) + 1):
# Make a request to the next page
page_url = f'{url}?page={page}'
page_response = requests.get(page_url)
# Parse the HTML content with Beautiful Soup
page_soup = BeautifulSoup(page_response.content, 'html.parser')
# Find the table with the data
page_table = page_soup.find('table')
# Extract the table rows
for tr in page_table.find_all('tr')[1:]:
row = []
for td in tr.find_all('td'):
row.append(td.text.strip())
rows.append(row)
# Create a Pandas DataFrame with the data
df = pd.DataFrame(rows, columns=headers)
# Save the DataFrame to a CSV file
df.to_csv('digital-innovation-hubs.csv', index=False)
see what i am getting back – if i run this in google-colab
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-1-f87e37f02fde> in <module>
27
28 # Find the total number of pages
---> 29 num_pages = soup.find('input', {'id': 'paginationPagesNum'})['value']
30
31 # Loop through each page and extract the data
TypeError: 'NoneType' object is not subscriptable
update: see what is gotten back:
due to the help of riley.johnson3 i found out that the pagination wrapper should be fixed.
- awesome many thanks for the quick help and the explanation – have gathered a set of data – its a sample. have to find out now how to get the full set of data. all the 700 records – with all the data.. – guess that we are allmost there. – again many thanks for your outstanding help. this is great… and appreciated alot… 😉
Answers:
The problem is that the id you’re using (paginationPagesNum
) does not exist in the page. This statement returns None
:
soup.find('input', {'id': 'paginationPagesNum'})
You’re trying to access the 'value'
attribute from a NoneType
, which is what causes the error. To fix it, you need to find the right tag. This code sample finds the pagination wrapper, finds the individual elements, and determines their length:
pagination_wrapper = soup.select_one(
'#_eu_europa_ec_jrc_dih_web_DihWebPortlet_list-web-page-iterator'
)
pagination_items = pagination_wrapper.select(
'ul > li:not(.ecl-pagination__item--next)'
)
num_pages = len(pagination_items)
Alternatively, here’s a one-liner to achieve the same thing:
num_pages = len(soup.select('#_eu_europa_ec_jrc_dih_web_DihWebPortlet_list-web-page-iterator > ul > li:not(.ecl-pagination__item--next)'))
Note that the :not(.ecl-pagination__item--next)
is required to filter out the next page button; otherwise, num_pages
would be off by 1.
pandas
-only Solution
If you just want the [immediately visible] table data, you can just use pandas read_html
on loop until it raises an exception and then concat
to concatenate all the scraped DataFrames together:
# import pandas as pd
output_fp = 'digital-innovation-hubs.csv'
dfList, pg_num, max_pg = [], 0, None
base_url = 'https://s3platform.jrc.ec.europa.eu/digital-innovation-hubs-tool'
while (pg_num:=pg_num+1) and (not isinstance(max_pg,int) or pg_num<max_pg):
pg_url = f'{base_url}?_eu_europa_ec_jrc_dih_web_DihWebPortlet_cur={pg_num}'
# try: dfList += pd.read_html(pg_url, extract_links='all')[:1] ## [needs v1.5.0.]
try: dfList += pd.read_html(pg_url)[:1]
except Exception as e: pg_num, _ = -1, print(f'n{e!r} from {pg_url}')
else: print('', end=f'rScraped {len(dfList[-1])} rows from {pg_url}')
# pd.concat(dfList).to_csv(output_fp, index=False) ## save without page numbers
df=pd.concat(dfList, keys=list(range(1,len(dfList)+1)),names=['from_pg','pgi'])
df.reset_index().drop('pgi',axis='columns').to_csv(output_fp, index=False)
As you would see from the output, the links are not scraped. However, it should be noted that in pandas
1.5.0 you can set an extract_links
parameter in read_html
; results will look like this, but can be cleaned up with something like:
# df = pd.concat(dfList.....
orig_cols = [c for c in df.columns if c != 'from_pg']
for ocn in orig_cols:
if any(vals:=[cv for cv,*_ in df[ocn]]): df[ocn[0]] = vals
if any(links:=[c[1] for c in df[ocn]]): df[ocn[0].split()[0]+' Links'] = links
if 'Email Links' in df.columns:
df['Email'] = df['Email Links'].str.replace('mailto:','',1)
df = df.drop('Email Links', axis='columns')
df = df.drop(orig_cols, axis='columns')
# df.....to_csv(output_fp, index=False)
requests
+bs4
Solution
The function below(view outputs for first page) should extract all the data from a single table row (tr
tag):
def get_row_dict(trTag):
row = { td['data-ecl-table-header']: td.get_text(' ', strip=True)
for td in trTag.select('td[data-ecl-table-header]')}
for td in trTag.select('td[data-ecl-table-header]:has(a[href])'):
k, link = td['data-ecl-table-header'], td.find('a',href=True)['href']
if k=='Email' and link.startswith('mailto:'):
link = link.replace('mailto:', '', 1)
row[(k.split()[0]+' Link') if row[k] else k] = link
return row
My preferred approach when scraping paginated data is to use a while
loop conditional on there being a next link.
# import requests
# import pandas as pd
# from bs4 import BeautifulSoup
# def get_row_dict...
output_fp = 'digital-innovation-hubs.csv'
all_rows, pg_num, max_pg = [], 0, None
next_link = 'https://s3platform.jrc.ec.europa.eu/digital-innovation-hubs-tool'
while next_link:
pg_num += 1
if isinstance(max_pg, int) and pg_num>max_pg: break
pgSoup = BeautifulSoup((pgReq:=requests.get(next_link)).content, 'lxml')
rows = pgSoup.select('tr:has(td[data-ecl-table-header])')
all_rows += [{'from_pg': pg_num, **get_row_dict(r)} for r in rows]
# all_rows += [get_row_dict(r) for r in rows] # no "from_pg" column
## just for printing ##
pgNum = pgSoup.find('span', {'aria-current':"true", 'aria-label':True})
if pgNum: pgNum = ['',*pgNum.get_text(' ', strip=True).split()][-1]
from_pg=int(pgNum) if isinstance(pgNum,str) and pgNum.isdigit() else pg_num
rowCt = pgSoup.find('div', class_='ecl-u-type-prolonged-s')
rowCt = rowCt.text.split(':')[-1].strip() if rowCt else 'UNKNOWN'
vStr = f'{len(rows)} scraped [total: {len(all_rows)} of {rowCt}] - '
vStr += f'<{pgReq.status_code} {pgReq.reason}> from {pgReq.url}'
print(f'r[{pg_num}][{pgNum}] {vStr}', end='')
next_link = pgSoup.find('a', {'href':True, 'aria-label':'Go to next page'})
if next_link: next_link = next_link['href']
pd.DataFrame(all_rows).to_csv(output_fp, index=False)
ADDED EDIT: Scraping Hub Links
import requests
import pandas as pd
from bs4 import BeautifulSoup
def get_hub_section(h2_tag):
sibs = h2_tag.find_next_siblings(True) ## section contents
hTxt = ' '.join(h2_tag.get_text(' ').split()) ## header text - minimize space
if not sibs: return hTxt, None
if len(sibs)==1 and sibs[0].name in ['ul','ol']: ## list
l_items = [li.get_text(' ',strip=True) for li in sibs[0].select('li')]
return hTxt, l_items
if not (len(sibs)==1 and sibs[0].name=='dl'): ## paragraphs [default]
p_list = [p.get_text(' ') for p in sibs if p.text.strip()]
return hTxt, 'n'.join([' '.join(p.split()) for p in p_list])
sub_sections = {} ## section will be further parsed to a dictionary
for d in sibs[0].select('dt:has(+dd)'):
k = ' '.join(d.get_text(' ').split()) ## subheader - minimize whitespace
ss_text = d.find_next('dd').get_text(' ',strip=True)
links = [a['href'] for a in d.find_next('dd').find_all('a', href=True)]
if len(links)==1: links = links[0]
if links: sub_sections[f'{k} Links'] = links
if ss_text and ss_text!=links: sub_sections[k] = ss_text
return hTxt, sub_sections
def get_hub_info(hub_link, pre_print='', post_print=''):
hSoup = BeautifulSoup((hReq := requests.get(hub_link)).content, 'lxml')
hub_info = [get_hub_section(h2) for h2 in hSoup.find_all('h2', {'id':True})]
rStatus = f'<{hReq.status_code} {hReq.reason}> from {hReq.url}'
print(f'r{pre_print}{len(hub_info)} sections {rStatus}', end=post_print)
return {'Hub Link': hub_link, **{k: v for k, v in hub_info}}
output_fp = 'digital-innovation-hubs.csv' ## from previous output
hubs_opfp = 'DIH-hubs-detailed.csv' ## can be the same as output_fp to replace it
### GET LIST OF LINKS ###
# hub_links = set(df['Hub Link']) ## if you're scraping for both outputs at once
hub_links = set(pd.read_csv(output_fp)['Hub Link']) ## if you already have output_fp
## SCRAPE AND SAVE ##
llen = len(hub_links)
data = [get_hub_info(l, f'[{i} of {llen}] ') for i, l in enumerate(hub_links,1)]
pd.json_normalize(data, sep=' → ').to_csv(hubs_opfp, index=False)
# pd.DataFrame(data).to_csv(hubs_opfp, index=False) ## doesn't expand subheaders
update: tried the scripts of Driftr95 .. in google-colab – and got some questions – the scripts failed – and was not succesful – queston. at the beginning of the scripts i have noticed that some lines are commendted out. why is this so. i will try to investigate more – meanwhile thanks alot..-. awesome.
two ideas that come up to mind:
a. the whole site of a result-page contains even more data: see
see the results of one (of 700 ) pages:
the digital innovation hub: 4PDIH – Public Private People Partnership Digital Innovation Hub
the dataset with the categories:
Hub Information
Description
Contact Data
Organisation
Technologies
Market and Services
Service Examples
Funding
Customers
Partners
second idea: with the awesme-scripts:
NameError("name 'pd' is not defined") from https://s3platform.jrc.ec.europa.eu/digital-innovation-hubs-tool?_eu_europa_ec_jrc_dih_web_DihWebPortlet_cur=1
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-3-c1f39e3c2547> in <module>
11
12 # pd.concat(dfList).to_csv(output_fp, index=False) ## save without page numbers
---> 13 df=pd.concat(dfList, keys=list(range(1,len(dfList)+1)),names=['from_pg','pgi'])
14 df.reset_index().drop('pgi',axis='columns').to_csv(output_fp, index=False)
NameError: name 'pd' is not defined
and besides that – in the next trial
NameError Traceback (most recent call last)
<ipython-input-5-538670405002> in <module>
1 # df = pd.concat(dfList.....
----> 2 orig_cols = list(df.columns)
3 for ocn in orig_cols:
4 if any(vals:=[cv for cv,*_ in df[ocn]]): df[ocn[0]] = vals
5 if any(links:=[c[1] for c in df[ocn]]): df[ocn[0].split()[0]+' Links'] = links
NameError: name 'df' is not defined
and the next trail
NameError Traceback (most recent call last)
<ipython-input-1-4a00208c3fe6> in <module>
10 pg_num += 1
11 if isinstance(max_pg, int) and pg_num>max_pg: break
---> 12 pgSoup = BeautifulSoup((pgReq:=requests.get(next_link)).content, 'lxml')
13 rows = pgSoup.select('tr:has(td[data-ecl-table-header])')
14 all_rows += [{'from_pg': pg_num, **get_row_dict(r)} for r in rows]
NameError: name 'BeautifulSoup' is not defined
end of the update:
the full story:
I am currently trying to learn Beautiful Soup (BS4), starting with fetching data.
With a scraper that should work with beautiful soup and scrapes the dataset of this page and puts the data into a csv format or uses pandas. If I run this in Google colab – I am facing some weird issues. See below:
import requests
from bs4 import BeautifulSoup
import pandas as pd
# Make a request to the webpage
url = 'https://s3platform.jrc.ec.europa.eu/digital-innovation-hubs-tool'
response = requests.get(url)
# Parse the HTML content with Beautiful Soup
soup = BeautifulSoup(response.content, 'html.parser')
# Find the table with the data
table = soup.find('table')
# Extract the table headers
headers = []
for th in table.find_all('th'):
headers.append(th.text.strip())
# Extract the table rows
rows = []
for tr in table.find_all('tr')[1:]:
row = []
for td in tr.find_all('td'):
row.append(td.text.strip())
rows.append(row)
# Find the total number of pages
num_pages = soup.find('input', {'id': 'paginationPagesNum'})['value']
# Loop through each page and extract the data
for page in range(2, int(num_pages) + 1):
# Make a request to the next page
page_url = f'{url}?page={page}'
page_response = requests.get(page_url)
# Parse the HTML content with Beautiful Soup
page_soup = BeautifulSoup(page_response.content, 'html.parser')
# Find the table with the data
page_table = page_soup.find('table')
# Extract the table rows
for tr in page_table.find_all('tr')[1:]:
row = []
for td in tr.find_all('td'):
row.append(td.text.strip())
rows.append(row)
# Create a Pandas DataFrame with the data
df = pd.DataFrame(rows, columns=headers)
# Save the DataFrame to a CSV file
df.to_csv('digital-innovation-hubs.csv', index=False)
see what i am getting back – if i run this in google-colab
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-1-f87e37f02fde> in <module>
27
28 # Find the total number of pages
---> 29 num_pages = soup.find('input', {'id': 'paginationPagesNum'})['value']
30
31 # Loop through each page and extract the data
TypeError: 'NoneType' object is not subscriptable
update: see what is gotten back:
due to the help of riley.johnson3 i found out that the pagination wrapper should be fixed.
- awesome many thanks for the quick help and the explanation – have gathered a set of data – its a sample. have to find out now how to get the full set of data. all the 700 records – with all the data.. – guess that we are allmost there. – again many thanks for your outstanding help. this is great… and appreciated alot… 😉
The problem is that the id you’re using (paginationPagesNum
) does not exist in the page. This statement returns None
:
soup.find('input', {'id': 'paginationPagesNum'})
You’re trying to access the 'value'
attribute from a NoneType
, which is what causes the error. To fix it, you need to find the right tag. This code sample finds the pagination wrapper, finds the individual elements, and determines their length:
pagination_wrapper = soup.select_one(
'#_eu_europa_ec_jrc_dih_web_DihWebPortlet_list-web-page-iterator'
)
pagination_items = pagination_wrapper.select(
'ul > li:not(.ecl-pagination__item--next)'
)
num_pages = len(pagination_items)
Alternatively, here’s a one-liner to achieve the same thing:
num_pages = len(soup.select('#_eu_europa_ec_jrc_dih_web_DihWebPortlet_list-web-page-iterator > ul > li:not(.ecl-pagination__item--next)'))
Note that the :not(.ecl-pagination__item--next)
is required to filter out the next page button; otherwise, num_pages
would be off by 1.
pandas
-only Solution
If you just want the [immediately visible] table data, you can just use pandas read_html
on loop until it raises an exception and then concat
to concatenate all the scraped DataFrames together:
# import pandas as pd
output_fp = 'digital-innovation-hubs.csv'
dfList, pg_num, max_pg = [], 0, None
base_url = 'https://s3platform.jrc.ec.europa.eu/digital-innovation-hubs-tool'
while (pg_num:=pg_num+1) and (not isinstance(max_pg,int) or pg_num<max_pg):
pg_url = f'{base_url}?_eu_europa_ec_jrc_dih_web_DihWebPortlet_cur={pg_num}'
# try: dfList += pd.read_html(pg_url, extract_links='all')[:1] ## [needs v1.5.0.]
try: dfList += pd.read_html(pg_url)[:1]
except Exception as e: pg_num, _ = -1, print(f'n{e!r} from {pg_url}')
else: print('', end=f'rScraped {len(dfList[-1])} rows from {pg_url}')
# pd.concat(dfList).to_csv(output_fp, index=False) ## save without page numbers
df=pd.concat(dfList, keys=list(range(1,len(dfList)+1)),names=['from_pg','pgi'])
df.reset_index().drop('pgi',axis='columns').to_csv(output_fp, index=False)
As you would see from the output, the links are not scraped. However, it should be noted that in pandas
1.5.0 you can set an extract_links
parameter in read_html
; results will look like this, but can be cleaned up with something like:
# df = pd.concat(dfList.....
orig_cols = [c for c in df.columns if c != 'from_pg']
for ocn in orig_cols:
if any(vals:=[cv for cv,*_ in df[ocn]]): df[ocn[0]] = vals
if any(links:=[c[1] for c in df[ocn]]): df[ocn[0].split()[0]+' Links'] = links
if 'Email Links' in df.columns:
df['Email'] = df['Email Links'].str.replace('mailto:','',1)
df = df.drop('Email Links', axis='columns')
df = df.drop(orig_cols, axis='columns')
# df.....to_csv(output_fp, index=False)
requests
+bs4
Solution
The function below(view outputs for first page) should extract all the data from a single table row (tr
tag):
def get_row_dict(trTag):
row = { td['data-ecl-table-header']: td.get_text(' ', strip=True)
for td in trTag.select('td[data-ecl-table-header]')}
for td in trTag.select('td[data-ecl-table-header]:has(a[href])'):
k, link = td['data-ecl-table-header'], td.find('a',href=True)['href']
if k=='Email' and link.startswith('mailto:'):
link = link.replace('mailto:', '', 1)
row[(k.split()[0]+' Link') if row[k] else k] = link
return row
My preferred approach when scraping paginated data is to use a while
loop conditional on there being a next link.
# import requests
# import pandas as pd
# from bs4 import BeautifulSoup
# def get_row_dict...
output_fp = 'digital-innovation-hubs.csv'
all_rows, pg_num, max_pg = [], 0, None
next_link = 'https://s3platform.jrc.ec.europa.eu/digital-innovation-hubs-tool'
while next_link:
pg_num += 1
if isinstance(max_pg, int) and pg_num>max_pg: break
pgSoup = BeautifulSoup((pgReq:=requests.get(next_link)).content, 'lxml')
rows = pgSoup.select('tr:has(td[data-ecl-table-header])')
all_rows += [{'from_pg': pg_num, **get_row_dict(r)} for r in rows]
# all_rows += [get_row_dict(r) for r in rows] # no "from_pg" column
## just for printing ##
pgNum = pgSoup.find('span', {'aria-current':"true", 'aria-label':True})
if pgNum: pgNum = ['',*pgNum.get_text(' ', strip=True).split()][-1]
from_pg=int(pgNum) if isinstance(pgNum,str) and pgNum.isdigit() else pg_num
rowCt = pgSoup.find('div', class_='ecl-u-type-prolonged-s')
rowCt = rowCt.text.split(':')[-1].strip() if rowCt else 'UNKNOWN'
vStr = f'{len(rows)} scraped [total: {len(all_rows)} of {rowCt}] - '
vStr += f'<{pgReq.status_code} {pgReq.reason}> from {pgReq.url}'
print(f'r[{pg_num}][{pgNum}] {vStr}', end='')
next_link = pgSoup.find('a', {'href':True, 'aria-label':'Go to next page'})
if next_link: next_link = next_link['href']
pd.DataFrame(all_rows).to_csv(output_fp, index=False)
ADDED EDIT: Scraping Hub Links
import requests
import pandas as pd
from bs4 import BeautifulSoup
def get_hub_section(h2_tag):
sibs = h2_tag.find_next_siblings(True) ## section contents
hTxt = ' '.join(h2_tag.get_text(' ').split()) ## header text - minimize space
if not sibs: return hTxt, None
if len(sibs)==1 and sibs[0].name in ['ul','ol']: ## list
l_items = [li.get_text(' ',strip=True) for li in sibs[0].select('li')]
return hTxt, l_items
if not (len(sibs)==1 and sibs[0].name=='dl'): ## paragraphs [default]
p_list = [p.get_text(' ') for p in sibs if p.text.strip()]
return hTxt, 'n'.join([' '.join(p.split()) for p in p_list])
sub_sections = {} ## section will be further parsed to a dictionary
for d in sibs[0].select('dt:has(+dd)'):
k = ' '.join(d.get_text(' ').split()) ## subheader - minimize whitespace
ss_text = d.find_next('dd').get_text(' ',strip=True)
links = [a['href'] for a in d.find_next('dd').find_all('a', href=True)]
if len(links)==1: links = links[0]
if links: sub_sections[f'{k} Links'] = links
if ss_text and ss_text!=links: sub_sections[k] = ss_text
return hTxt, sub_sections
def get_hub_info(hub_link, pre_print='', post_print=''):
hSoup = BeautifulSoup((hReq := requests.get(hub_link)).content, 'lxml')
hub_info = [get_hub_section(h2) for h2 in hSoup.find_all('h2', {'id':True})]
rStatus = f'<{hReq.status_code} {hReq.reason}> from {hReq.url}'
print(f'r{pre_print}{len(hub_info)} sections {rStatus}', end=post_print)
return {'Hub Link': hub_link, **{k: v for k, v in hub_info}}
output_fp = 'digital-innovation-hubs.csv' ## from previous output
hubs_opfp = 'DIH-hubs-detailed.csv' ## can be the same as output_fp to replace it
### GET LIST OF LINKS ###
# hub_links = set(df['Hub Link']) ## if you're scraping for both outputs at once
hub_links = set(pd.read_csv(output_fp)['Hub Link']) ## if you already have output_fp
## SCRAPE AND SAVE ##
llen = len(hub_links)
data = [get_hub_info(l, f'[{i} of {llen}] ') for i, l in enumerate(hub_links,1)]
pd.json_normalize(data, sep=' → ').to_csv(hubs_opfp, index=False)
# pd.DataFrame(data).to_csv(hubs_opfp, index=False) ## doesn't expand subheaders