pd.read_html(url) – awkward table design

Question:

Table headings through the table are being converted into single column headings.

url = "https://www.environment.nsw.gov.au/topics/animals-and-plants/threatened-species/programs-legislation-and-framework/nsw-koala-strategy/local-government-resources-for-koala-conservation/north-coast-koala-management-area#:~:text=The%20North%20Coast%20Koala%20Management,Valley%2C%20Clarence%20Valley%20and%20Taree."
dfs = pd.read_html(url)
df = dfs[0]
df.head()

output

Be great if I could have the High preferred use as a column that assigns to the correct species.
Tried reset_index() this did not work.
I’m lost for searching can’t find anything similar.

Response to @Master Oogway and thanks @DYZ for the edits.

There are multiple "table-striped"

Screen shot inspect element - multiple class ="table-striped"

The amendment suggested removes the error, but does not interact with the second table.
Take White Box, Eucalyptus albens. Occurs in second table and not first.
If I export dftable and filter – no White Box:

Filter no White Box

If I write htmltable to .txt when using find_all and search, it’s there:

enter image description here

I have never done this before and appreciate that this is annoying.
Thanks for the help so far.

It appears that find_all is gathering all the table data.
But the creating of dftable is limiting to the first "table-striped".

Asked By: GWAE

||

Answers:

The table cannot be easily parsed with read_html because of its unorthodox use of <thead> attribute. You can try luck with BeautifulSoup:

import bs4
import urllib.request

soup = bs4.BeautifulSoup(urllib.request.urlopen(url))
data = [["".join(cell.strings).strip() 
         for cell in row.find_all(['td', 'th'])] 
         for row in soup.find_all('table')[0].find_all('tr')]    
table = pd.DataFrame(data[1:])
          .rename(columns=dict(enumerate(data[0])))
          .dropna(how='all')
Answered By: DYZ

So I took a look at the link and the table you’re trying to get.

The problem with the table in the link is that it contains multiple headers so the .read_html(URL) function, gets all of them and sets those as your
header:

table HTML head content

so instead of using pandas to read the HTML I used
beautiful soup for what you’re trying to accomplish.

With beautiful and urllib.requests I got the HTML from the URL and extracted the HTML with the table class name

url = "https://www.environment.nsw.gov.au/topics/animals-and-plants/threatened-species/programs-legislation-and-framework/nsw-koala-strategy/local-government-resources-for-koala-conservation/north-coast-koala-management-area#:~:text=The%20North%20Coast%20Koala%20Management,Valley%2C%20Clarence%20Valley%20and%20Taree."

#load html with urllib
html = urllib.request.urlopen(url)
soup = BeautifulSoup(html.read(), 'lxml')


#get the table you're trying to get based
#on html elements
htmltable = soup.find('table', { 'class' : 'table-striped' })

Then using a function I found to make a list from tables extract from beautiful soup, I modified the function to get your values in a shape that would be easy to load into a dataframe and would also be easy to call depending on what you want:

[{"common name" : value, "Species name": value, "type": value}…{}]

def tableDataText(table):    
    """Parses a html segment started with tag <table> followed 
    by multiple <tr> (table rows) and inner <td> (table data) tags. 
    It returns a list of rows with inner columns. 
    Accepts only one <th> (table header/data) in the first row.
    """
    def rowgetDataText(tr, coltag='td'): # td (data) or th (header)       
        return [td.get_text(strip=True) for td in tr.find_all(coltag)]  
    rows = []
    trs = table.find_all('tr')
    headerow = rowgetDataText(trs[0], 'th')
    

    if headerow: # if there is a header row include first
        trs = trs[1:]
    for tr in trs: # for every table row

        #this part is modified
        #basically we'll get the type of 
        #used based of the second table header
        #in your url table html
        if(rowgetDataText(tr, 'th')):
          last_head = rowgetDataText(tr, 'th')

        #we'll add to the list a dict
        #that contains "common name", "species name", "type" (use type)
        if(rowgetDataText(tr, 'td')):
          row = rowgetDataText(tr, 'td')
          rows.append({headerow[0]: row[0], headerow[1]: row[1], 'type': last_head[0]})
              
    return rows

then when we convert the results of that function using
the table content we extracted with beautiful soup we get this:

dataframe of table from html

Then you can easily reference the type of use and each value common/species name.

Here is the full code:


import pandas as pd
from bs4 import BeautifulSoup
import urllib.request

url = "https://www.environment.nsw.gov.au/topics/animals-and-plants/threatened-species/programs-legislation-and-framework/nsw-koala-strategy/local-government-resources-for-koala-conservation/north-coast-koala-management-area#:~:text=The%20North%20Coast%20Koala%20Management,Valley%2C%20Clarence%20Valley%20and%20Taree."

#load html with urllib
html = urllib.request.urlopen(url)
soup = BeautifulSoup(html.read(), 'lxml')


#get the table you're trying to get based
#on html elements
htmltable = soup.find('table', { 'class' : 'table-striped' })


#modified function taken from: https://stackoverflow.com/a/58274853/6297478
#to fit your data shape in a way that 
#you can use. 
def tableDataText(table):    
    """Parses a html segment started with tag <table> followed 
    by multiple <tr> (table rows) and inner <td> (table data) tags. 
    It returns a list of rows with inner columns. 
    Accepts only one <th> (table header/data) in the first row.
    """
    def rowgetDataText(tr, coltag='td'): # td (data) or th (header)       
        return [td.get_text(strip=True) for td in tr.find_all(coltag)]  
    rows = []
    trs = table.find_all('tr')
    headerow = rowgetDataText(trs[0], 'th')
    

    if headerow: # if there is a header row include first
        trs = trs[1:]
    for tr in trs: # for every table row

        #this part is modified
        #basically we'll get the type of 
        #used based of the second table header
        #in your url table html
        if(rowgetDataText(tr, 'th')):
          last_head = rowgetDataText(tr, 'th')

        #we'll add to the list a dict
        #that contains "common name", "species name", "type" (use type)
        if(rowgetDataText(tr, 'td')):
          row = rowgetDataText(tr, 'td')
          rows.append({headerow[0]: row[0], headerow[1]: row[1], 'type': last_head[0]})
              
    return rows

#we store our results from the function in list_table
list_table = tableDataText(htmltable)

#turn our table into a DataFrame
dftable = pd.DataFrame(list_table)
dftable

I left some comments for you in the code to help you out.

I hope this helps!

Answered By: Master Oogway

Just in addition to @DYZ approach, using css selectors, stripped_strings and find_previous(). This will create a list of dicts that will be transformed into a dataframe:

from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "https://www.environment.nsw.gov.au/topics/animals-and-plants/threatened-species/programs-legislation-and-framework/nsw-koala-strategy/local-government-resources-for-koala-conservation/north-coast-koala-management-area#:~:text=The%20North%20Coast%20Koala%20Management,Valley%2C%20Clarence%20Valley%20and%20Taree."

data = []
soup = BeautifulSoup(requests.get(url).text)
for e in soup.select('table tbody tr'):
    data.append(
        dict(
            zip(
                soup.table.thead.stripped_strings,
                [e.find_previous('th').get_text(strip=True)]+list(e.stripped_strings)
            )
        )
    )

pd.DataFrame(data)
Common name Species name High preferred use
0 High preferred use Grey gum Eucalyptus biturbinata
1 High preferred use Large-fruited grey gum Eucalyptus canaliculata
107 Occasional use Broad-leaved paperbark Melaleuca quinquenervia
108 Occasional use nan nan
Answered By: HedgeHog
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.