NoneType' object is not subscriptable : bs4 task fails permanently

Question:

update: tried the scripts of Driftr95 .. in google-colab – and got some questions – the scripts failed – and was not succesful – queston. at the beginning of the scripts i have noticed that some lines are commendted out. why is this so. i will try to investigate more – meanwhile thanks alot..-. awesome.

two ideas that come up to mind:
a. the whole site of a result-page contains even more data: see

see the results of one (of 700 ) pages:

the digital innovation hub: 4PDIH – Public Private People Partnership Digital Innovation Hub

https://s3platform.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/17265/view?_eu_europa_ec_jrc_dih_web_DihWebPortlet_backUrl=%2Fdigital-innovation-hubs-tool

the dataset with the categories:

Hub Information
Description
Contact Data
Organisation
Technologies
Market and Services
Service Examples
Funding
Customers
Partners

enter image description here

second idea: with the awesme-scripts:

NameError("name 'pd' is not defined") from https://s3platform.jrc.ec.europa.eu/digital-innovation-hubs-tool?_eu_europa_ec_jrc_dih_web_DihWebPortlet_cur=1
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-3-c1f39e3c2547> in <module>
     11 
     12 # pd.concat(dfList).to_csv(output_fp, index=False) ## save without page numbers
---> 13 df=pd.concat(dfList, keys=list(range(1,len(dfList)+1)),names=['from_pg','pgi'])
     14 df.reset_index().drop('pgi',axis='columns').to_csv(output_fp, index=False)

NameError: name 'pd' is not defined

and besides that – in the next trial

NameError                                 Traceback (most recent call last)
<ipython-input-5-538670405002> in <module>
      1 # df = pd.concat(dfList.....
----> 2 orig_cols = list(df.columns)
      3 for ocn in orig_cols:
      4     if any(vals:=[cv for cv,*_ in df[ocn]]): df[ocn[0]] = vals
      5     if any(links:=[c[1] for c in df[ocn]]): df[ocn[0].split()[0]+' Links'] = links

NameError: name 'df' is not defined

and the next trail

NameError                                 Traceback (most recent call last)
<ipython-input-1-4a00208c3fe6> in <module>
     10     pg_num += 1
     11     if isinstance(max_pg, int) and pg_num>max_pg: break
---> 12     pgSoup = BeautifulSoup((pgReq:=requests.get(next_link)).content, 'lxml')
     13     rows = pgSoup.select('tr:has(td[data-ecl-table-header])')
     14     all_rows += [{'from_pg': pg_num, **get_row_dict(r)} for r in rows]

NameError: name 'BeautifulSoup' is not defined

end of the update:

the full story:

I am currently trying to learn Beautiful Soup (BS4), starting with fetching data.

With a scraper that should work with beautiful soup and scrapes the dataset of this page and puts the data into a csv format or uses pandas. If I run this in Google colab – I am facing some weird issues. See below:

import requests
from bs4 import BeautifulSoup
import pandas as pd

# Make a request to the webpage
url = 'https://s3platform.jrc.ec.europa.eu/digital-innovation-hubs-tool'
response = requests.get(url)

# Parse the HTML content with Beautiful Soup
soup = BeautifulSoup(response.content, 'html.parser')

# Find the table with the data
table = soup.find('table')

# Extract the table headers
headers = []
for th in table.find_all('th'):
    headers.append(th.text.strip())

# Extract the table rows
rows = []
for tr in table.find_all('tr')[1:]:
    row = []
    for td in tr.find_all('td'):
        row.append(td.text.strip())
    rows.append(row)

# Find the total number of pages
num_pages = soup.find('input', {'id': 'paginationPagesNum'})['value']

# Loop through each page and extract the data
for page in range(2, int(num_pages) + 1):
    # Make a request to the next page
    page_url = f'{url}?page={page}'
    page_response = requests.get(page_url)

    # Parse the HTML content with Beautiful Soup
    page_soup = BeautifulSoup(page_response.content, 'html.parser')

    # Find the table with the data
    page_table = page_soup.find('table')

    # Extract the table rows
    for tr in page_table.find_all('tr')[1:]:
        row = []
        for td in tr.find_all('td'):
            row.append(td.text.strip())
        rows.append(row)

# Create a Pandas DataFrame with the data
df = pd.DataFrame(rows, columns=headers)

# Save the DataFrame to a CSV file
df.to_csv('digital-innovation-hubs.csv', index=False)

see what i am getting back – if i run this in google-colab

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-1-f87e37f02fde> in <module>
     27 
     28 # Find the total number of pages
---> 29 num_pages = soup.find('input', {'id': 'paginationPagesNum'})['value']
     30 
     31 # Loop through each page and extract the data

TypeError: 'NoneType' object is not subscriptable

update: see what is gotten back:

enter image description here

due to the help of riley.johnson3 i found out that the pagination wrapper should be fixed.

  • awesome many thanks for the quick help and the explanation – have gathered a set of data – its a sample. have to find out now how to get the full set of data. all the 700 records – with all the data.. – guess that we are allmost there. – again many thanks for your outstanding help. this is great… and appreciated alot… 😉
Asked By: malaga

||

Answers:

The problem is that the id you’re using (paginationPagesNum) does not exist in the page. This statement returns None:

soup.find('input', {'id': 'paginationPagesNum'})

You’re trying to access the 'value' attribute from a NoneType, which is what causes the error. To fix it, you need to find the right tag. This code sample finds the pagination wrapper, finds the individual elements, and determines their length:

pagination_wrapper = soup.select_one(
    '#_eu_europa_ec_jrc_dih_web_DihWebPortlet_list-web-page-iterator'
)

pagination_items = pagination_wrapper.select(
    'ul > li:not(.ecl-pagination__item--next)'
)

num_pages = len(pagination_items)

Alternatively, here’s a one-liner to achieve the same thing:

num_pages = len(soup.select('#_eu_europa_ec_jrc_dih_web_DihWebPortlet_list-web-page-iterator > ul > li:not(.ecl-pagination__item--next)'))

Note that the :not(.ecl-pagination__item--next) is required to filter out the next page button; otherwise, num_pages would be off by 1.

Answered By: riley.johnson3

pandas-only Solution

If you just want the [immediately visible] table data, you can just use pandas read_html on loop until it raises an exception and then concat to concatenate all the scraped DataFrames together:

# import pandas as pd
output_fp = 'digital-innovation-hubs.csv'
dfList, pg_num, max_pg = [], 0, None
base_url = 'https://s3platform.jrc.ec.europa.eu/digital-innovation-hubs-tool'
while (pg_num:=pg_num+1) and (not isinstance(max_pg,int) or pg_num<max_pg):
    pg_url = f'{base_url}?_eu_europa_ec_jrc_dih_web_DihWebPortlet_cur={pg_num}'
    # try: dfList += pd.read_html(pg_url, extract_links='all')[:1] ## [needs v1.5.0.]
    try: dfList += pd.read_html(pg_url)[:1]
    except Exception as e: pg_num, _ = -1, print(f'n{e!r} from {pg_url}')
    else: print('', end=f'rScraped {len(dfList[-1])} rows from {pg_url}')

# pd.concat(dfList).to_csv(output_fp, index=False) ## save without page numbers
df=pd.concat(dfList, keys=list(range(1,len(dfList)+1)),names=['from_pg','pgi'])
df.reset_index().drop('pgi',axis='columns').to_csv(output_fp, index=False)

As you would see from the output, the links are not scraped. However, it should be noted that in pandas 1.5.0 you can set an extract_links parameter in read_html; results will look like this, but can be cleaned up with something like:

# df = pd.concat(dfList.....
orig_cols = [c for c in df.columns if c != 'from_pg']
for ocn in orig_cols:
    if any(vals:=[cv for cv,*_ in df[ocn]]): df[ocn[0]] = vals
    if any(links:=[c[1] for c in df[ocn]]): df[ocn[0].split()[0]+' Links'] = links
if 'Email Links' in df.columns:
    df['Email'] = df['Email Links'].str.replace('mailto:','',1)
    df = df.drop('Email Links', axis='columns')
df = df.drop(orig_cols, axis='columns')
# df.....to_csv(output_fp, index=False)


requests+bs4 Solution

The function below(view outputs for first page) should extract all the data from a single table row (tr tag):

def get_row_dict(trTag):
    row = { td['data-ecl-table-header']: td.get_text(' ', strip=True) 
            for td in trTag.select('td[data-ecl-table-header]')} 
    for td in trTag.select('td[data-ecl-table-header]:has(a[href])'):
        k, link = td['data-ecl-table-header'], td.find('a',href=True)['href']
        if k=='Email' and link.startswith('mailto:'):
            link = link.replace('mailto:', '', 1)
        row[(k.split()[0]+' Link') if row[k] else k] = link
    return row 

My preferred approach when scraping paginated data is to use a while loop conditional on there being a next link.

# import requests
# import pandas as pd
# from bs4 import BeautifulSoup
# def get_row_dict...
output_fp = 'digital-innovation-hubs.csv'

all_rows, pg_num, max_pg = [], 0, None
next_link = 'https://s3platform.jrc.ec.europa.eu/digital-innovation-hubs-tool'
while next_link:
    pg_num += 1
    if isinstance(max_pg, int) and pg_num>max_pg: break
    pgSoup = BeautifulSoup((pgReq:=requests.get(next_link)).content, 'lxml')
    rows = pgSoup.select('tr:has(td[data-ecl-table-header])')
    all_rows += [{'from_pg': pg_num, **get_row_dict(r)} for r in rows]
    # all_rows += [get_row_dict(r) for r in rows] # no "from_pg" column

    ## just for printing ##
    pgNum = pgSoup.find('span', {'aria-current':"true", 'aria-label':True})
    if pgNum: pgNum = ['',*pgNum.get_text(' ', strip=True).split()][-1]
    from_pg=int(pgNum) if isinstance(pgNum,str) and pgNum.isdigit() else pg_num
    rowCt = pgSoup.find('div', class_='ecl-u-type-prolonged-s')
    rowCt = rowCt.text.split(':')[-1].strip() if rowCt else 'UNKNOWN'  
    vStr = f'{len(rows)} scraped [total: {len(all_rows)} of {rowCt}] - '
    vStr += f'<{pgReq.status_code} {pgReq.reason}> from {pgReq.url}'
    print(f'r[{pg_num}][{pgNum}] {vStr}', end='')

    next_link = pgSoup.find('a', {'href':True, 'aria-label':'Go to next page'})
    if next_link: next_link = next_link['href']

pd.DataFrame(all_rows).to_csv(output_fp, index=False)

opdf


ADDED EDIT: Scraping Hub Links

import requests
import pandas as pd
from bs4 import BeautifulSoup
def get_hub_section(h2_tag):
    sibs = h2_tag.find_next_siblings(True) ## section contents
    hTxt = ' '.join(h2_tag.get_text(' ').split()) ## header text - minimize space
    if not sibs: return hTxt, None

    if len(sibs)==1 and sibs[0].name in ['ul','ol']: ## list
        l_items = [li.get_text(' ',strip=True) for li in sibs[0].select('li')]
        return hTxt, l_items
    if not (len(sibs)==1 and sibs[0].name=='dl'): ## paragraphs [default]
        p_list = [p.get_text(' ') for p in sibs if p.text.strip()]
        return hTxt, 'n'.join([' '.join(p.split()) for p in p_list])
    
    sub_sections = {} ## section will be further parsed to a dictionary 
    for d in sibs[0].select('dt:has(+dd)'):
        k = ' '.join(d.get_text(' ').split()) ## subheader - minimize whitespace
        ss_text = d.find_next('dd').get_text(' ',strip=True)
        links = [a['href'] for a in d.find_next('dd').find_all('a', href=True)]
        if len(links)==1: links = links[0]

        if links: sub_sections[f'{k} Links'] = links
        if ss_text and ss_text!=links: sub_sections[k] = ss_text
    return hTxt, sub_sections
def get_hub_info(hub_link, pre_print='', post_print=''):
    hSoup = BeautifulSoup((hReq := requests.get(hub_link)).content, 'lxml') 
    hub_info = [get_hub_section(h2) for h2 in hSoup.find_all('h2', {'id':True})]

    rStatus = f'<{hReq.status_code} {hReq.reason}> from {hReq.url}'
    print(f'r{pre_print}{len(hub_info)} sections {rStatus}', end=post_print)
    return {'Hub Link': hub_link, **{k: v for k, v in hub_info}}
output_fp = 'digital-innovation-hubs.csv' ## from previous output
hubs_opfp = 'DIH-hubs-detailed.csv' ## can be the same as output_fp to replace it

### GET LIST OF LINKS ###
# hub_links = set(df['Hub Link']) ## if you're scraping for both outputs at once
hub_links = set(pd.read_csv(output_fp)['Hub Link']) ## if you already have output_fp

## SCRAPE AND SAVE ##
llen = len(hub_links)
data = [get_hub_info(l, f'[{i} of {llen}] ') for i, l in enumerate(hub_links,1)]
pd.json_normalize(data, sep=' → ').to_csv(hubs_opfp, index=False)
# pd.DataFrame(data).to_csv(hubs_opfp, index=False) ## doesn't expand subheaders

opdfnorm


[ view results spreadsheet ]

Answered By: Driftr95