pandas read_html function with colspan=2
Question:
I’m using the pandas read_html function to load an html table into a dataframe, however it’s failing because the source data has a colspan=2 merged header, resulting in this AssertionError: 6 columns passed, passed data had 7 columns.
I’ve tried various options with the header kwarg (header=None, header=[‘Code’…]) but nothing seems to work.
Does anyone know of any way to parse and html table with merged columns using pandas read_html?
Answers:
If you don’t insist on using read_html from pandas, this code does the job:
import pandas as pd
from lxml.html import parse
from urllib2 import urlopen
from pandas.io.parsers import TextParser
def _unpack(row, kind='td'):
elts = row.findall('.//%s' % kind)
return [val.text_content() for val in elts]
def parse_options_data(table):
rows = table.findall('.//tr')
header = _unpack(rows[0], kind='th')
data = [_unpack(r) for r in rows[1:]]
return TextParser(data, names=header).get_chunk()
parsed = parse(urlopen('http://www.bmfbovespa.com.br/en-us/intros/Limits-and-Haircuts-for-accepting-stocks-as-collateral.aspx?idioma=en-us'))
doc = parsed.getroot()
tables = doc.findall('.//table')
table = parse_options_data(tables[0])
This is taken from the Book “Python for Data analysis” from Wes McKinney.
pandas >= 0.24.0 understands colspan
and rowspan
attributes. As per the
release
notes:
result = pd.read_html("""
<table>
<thead>
<tr>
<th>A</th><th>B</th><th>C</th>
</tr>
</thead>
<tbody>
<tr>
<td colspan="2">1</td><td>2</td>
</tr>
</tbody>
</table>""")
result
Out:
[ A B C
0 1 1 2
Previously this would return the following:
[ A B C
0 1 2 NaN]
I can’t test with your link because the URL is not found.
I’m using the pandas read_html function to load an html table into a dataframe, however it’s failing because the source data has a colspan=2 merged header, resulting in this AssertionError: 6 columns passed, passed data had 7 columns.
I’ve tried various options with the header kwarg (header=None, header=[‘Code’…]) but nothing seems to work.
Does anyone know of any way to parse and html table with merged columns using pandas read_html?
If you don’t insist on using read_html from pandas, this code does the job:
import pandas as pd
from lxml.html import parse
from urllib2 import urlopen
from pandas.io.parsers import TextParser
def _unpack(row, kind='td'):
elts = row.findall('.//%s' % kind)
return [val.text_content() for val in elts]
def parse_options_data(table):
rows = table.findall('.//tr')
header = _unpack(rows[0], kind='th')
data = [_unpack(r) for r in rows[1:]]
return TextParser(data, names=header).get_chunk()
parsed = parse(urlopen('http://www.bmfbovespa.com.br/en-us/intros/Limits-and-Haircuts-for-accepting-stocks-as-collateral.aspx?idioma=en-us'))
doc = parsed.getroot()
tables = doc.findall('.//table')
table = parse_options_data(tables[0])
This is taken from the Book “Python for Data analysis” from Wes McKinney.
pandas >= 0.24.0 understands colspan
and rowspan
attributes. As per the
release
notes:
result = pd.read_html("""
<table>
<thead>
<tr>
<th>A</th><th>B</th><th>C</th>
</tr>
</thead>
<tbody>
<tr>
<td colspan="2">1</td><td>2</td>
</tr>
</tbody>
</table>""")
result
Out:
[ A B C
0 1 1 2
Previously this would return the following:
[ A B C
0 1 2 NaN]
I can’t test with your link because the URL is not found.