Extracting data from HTML table
Question:
I am looking for a way to get certain info from HTML in linux shell environment.
This is bit that I’m interested in :
<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
<tr valign="top">
<th>Tests</th>
<th>Failures</th>
<th>Success Rate</th>
<th>Average Time</th>
<th>Min Time</th>
<th>Max Time</th>
</tr>
<tr valign="top" class="Failure">
<td>103</td>
<td>24</td>
<td>76.70%</td>
<td>71 ms</td>
<td>0 ms</td>
<td>829 ms</td>
</tr>
</table>
And I want to store in shell variables or echo these in key value pairs extracted from above html. Example :
Tests : 103
Failures : 24
Success Rate : 76.70 %
and so on..
What I can do at the moment is to create a java program that will use sax parser or html parser such as jsoup to extract this info.
But using java here seems to be overhead with including the runnable jar inside the “wrapper” script you want to execute.
I’m sure that there must be “shell” languages out there that can do the same i.e. perl, python, bash etc.
My problem is that I have zero experience with these, can somebody help me resolve this “fairly easy” issue
Quick update:
I forgot to mention that I’ve got more tables and more rows in the .html document sorry about that (early morning).
Update #2:
Tried to install Bsoup like this since I don’t have root access :
$ wget http://www.crummy.com/software/BeautifulSoup/bs4/download/4.0/beautifulsoup4-4.1.0.tar.gz
$ tar -zxvf beautifulsoup4-4.1.0.tar.gz
$ cp -r beautifulsoup4-4.1.0/bs4 .
$ vi htmlParse.py # (paste code from ) Tichodromas' answer, just in case this (http://pastebin.com/4Je11Y9q) is what I pasted
$ run file (python htmlParse.py)
error:
$ python htmlParse.py
Traceback (most recent call last):
File "htmlParse.py", line 1, in ?
from bs4 import BeautifulSoup
File "/home/gdd/setup/py/bs4/__init__.py", line 29
from .builder import builder_registry
^
SyntaxError: invalid syntax
Update #3 :
Running Tichodromas’ answer get this error :
Traceback (most recent call last):
File "test.py", line 27, in ?
headings = [th.get_text() for th in table.find("tr").find_all("th")]
TypeError: 'NoneType' object is not callable
any ideas?
Answers:
undef $/;
$text = <DATA>;
@tabs = $text =~ m!<table.*?>(.*?)</table>!gms;
for (@tabs) {
@th = m!<th>(.*?)</th>!gms;
@td = m!<td>(.*?)</td>!gms;
}
for $i (0..$#th) {
printf "%-16st: %sn", $th[$i], $td[$i];
}
__DATA__
<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
<tr valign="top">
<th>Tests</th>
<th>Failures</th>
<th>Success Rate</th>
<th>Average Time</th>
<th>Min Time</th>
<th>Max Time</th>
</tr>
<tr valign="top" class="Failure">
<td>103</td>
<td>24</td>
<td>76.70%</td>
<td>71 ms</td>
<td>0 ms</td>
<td>829 ms</td>
</tr>
</table>
output as follows:
Tests : 103
Failures : 24
Success Rate : 76.70%
Average Time : 71 ms
Min Time : 0 ms
Max Time : 829 ms
A Python solution using BeautifulSoup4 (Edit: with proper skipping. Edit3: Using class="details"
to select the table
):
from bs4 import BeautifulSoup
html = """
<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
<tr valign="top">
<th>Tests</th>
<th>Failures</th>
<th>Success Rate</th>
<th>Average Time</th>
<th>Min Time</th>
<th>Max Time</th>
</tr>
<tr valign="top" class="Failure">
<td>103</td>
<td>24</td>
<td>76.70%</td>
<td>71 ms</td>
<td>0 ms</td>
<td>829 ms</td>
</tr>
</table>"""
soup = BeautifulSoup(html)
table = soup.find("table", attrs={"class":"details"})
# The first tr contains the field names.
headings = [th.get_text() for th in table.find("tr").find_all("th")]
datasets = []
for row in table.find_all("tr")[1:]:
dataset = zip(headings, (td.get_text() for td in row.find_all("td")))
datasets.append(dataset)
print datasets
The result looks like this:
[[(u'Tests', u'103'),
(u'Failures', u'24'),
(u'Success Rate', u'76.70%'),
(u'Average Time', u'71 ms'),
(u'Min Time', u'0 ms'),
(u'Max Time', u'829 ms')]]
Edit2: To produce the desired output, use something like this:
for dataset in datasets:
for field in dataset:
print "{0:<16}: {1}".format(field[0], field[1])
Result:
Tests : 103
Failures : 24
Success Rate : 76.70%
Average Time : 71 ms
Min Time : 0 ms
Max Time : 829 ms
A Python solution that uses only the standard library (takes advantage of the fact that the HTML happens to be well-formed XML). More than one row of data can be handled.
(Tested with Python 2.6 and 2.7. The question was updated saying that the OP uses Python 2.4, so this answer may not be very useful in this case. ElementTree was added in Python 2.5)
from xml.etree.ElementTree import fromstring
HTML = """
<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
<tr valign="top">
<th>Tests</th>
<th>Failures</th>
<th>Success Rate</th>
<th>Average Time</th>
<th>Min Time</th>
<th>Max Time</th>
</tr>
<tr valign="top" class="Failure">
<td>103</td>
<td>24</td>
<td>76.70%</td>
<td>71 ms</td>
<td>0 ms</td>
<td>829 ms</td>
</tr>
<tr valign="top" class="whatever">
<td>A</td>
<td>B</td>
<td>C</td>
<td>D</td>
<td>E</td>
<td>F</td>
</tr>
</table>"""
tree = fromstring(HTML)
rows = tree.findall("tr")
headrow = rows[0]
datarows = rows[1:]
for num, h in enumerate(headrow):
data = ", ".join([row[num].text for row in datarows])
print "{0:<16}: {1}".format(h.text, data)
Output:
Tests : 103, A
Failures : 24, B
Success Rate : 76.70%, C
Average Time : 71 ms, D
Min Time : 0 ms, E
Max Time : 829 ms, F
Assuming your html code is stored in a mycode.html file, here is a bash way:
paste -d: <(grep '<th>' mycode.html | sed -e 's,</*th>,,g') <(grep '<td>' mycode.html | sed -e 's,</*td>,,g')
note: the output is not perfectly aligned
Here is the top answer, adapted for Python3 compatibility, and improved by stripping whitespace in cells:
from bs4 import BeautifulSoup
html = """
<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
<tr valign="top">
<th>Tests</th>
<th>Failures</th>
<th>Success Rate</th>
<th>Average Time</th>
<th>Min Time</th>
<th>Max Time</th>
</tr>
<tr valign="top" class="Failure">
<td>103</td>
<td>24</td>
<td>76.70%</td>
<td>71 ms</td>
<td>0 ms</td>
<td>829 ms</td>
</tr>
</table>"""
soup = BeautifulSoup(s, 'html.parser')
table = soup.find("table")
# The first tr contains the field names.
headings = [th.get_text().strip() for th in table.find("tr").find_all("th")]
print(headings)
datasets = []
for row in table.find_all("tr")[1:]:
dataset = dict(zip(headings, (td.get_text() for td in row.find_all("td"))))
datasets.append(dataset)
print(datasets)
Below is a python regex based solution that I have tested on python 2.7. It doesn’t rely on xml module–so will work in case xml is not fully well formed.
import re
# input args: html string
# output: tables as a list, column max length
def extract_html_tables(html):
tables=[]
maxlen=0
rex1=r'<table.*?/table>'
rex2=r'<tr.*?/tr>'
rex3=r'<(td|th).*?/(td|th)>'
s = re.search(rex1,html,re.DOTALL)
while s:
t = s.group() # the table
s2 = re.search(rex2,t,re.DOTALL)
table = []
while s2:
r = s2.group() # the row
s3 = re.search(rex3,r,re.DOTALL)
row=[]
while s3:
d = s3.group() # the cell
#row.append(strip_tags(d).strip() )
row.append(d.strip() )
r = re.sub(rex3,'',r,1,re.DOTALL)
s3 = re.search(rex3,r,re.DOTALL)
table.append( row )
if maxlen<len(row):
maxlen = len(row)
t = re.sub(rex2,'',t,1,re.DOTALL)
s2 = re.search(rex2,t,re.DOTALL)
html = re.sub(rex1,'',html,1,re.DOTALL)
tables.append(table)
s = re.search(rex1,html,re.DOTALL)
return tables, maxlen
html = """
<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
<tr valign="top">
<th>Tests</th>
<th>Failures</th>
<th>Success Rate</th>
<th>Average Time</th>
<th>Min Time</th>
<th>Max Time</th>
</tr>
<tr valign="top" class="Failure">
<td>103</td>
<td>24</td>
<td>76.70%</td>
<td>71 ms</td>
<td>0 ms</td>
<td>829 ms</td>
</tr>
</table>"""
print extract_html_tables(html)
Use pandas.read_html:
import pandas as pd
html_tables = pd.read_html('resources/test.html')
df = html_tables[0]
df.T # transpose to align
0
Tests 103
Failures 24
Success Rate 76.70%
Average Time 71 ms
I am looking for a way to get certain info from HTML in linux shell environment.
This is bit that I’m interested in :
<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
<tr valign="top">
<th>Tests</th>
<th>Failures</th>
<th>Success Rate</th>
<th>Average Time</th>
<th>Min Time</th>
<th>Max Time</th>
</tr>
<tr valign="top" class="Failure">
<td>103</td>
<td>24</td>
<td>76.70%</td>
<td>71 ms</td>
<td>0 ms</td>
<td>829 ms</td>
</tr>
</table>
And I want to store in shell variables or echo these in key value pairs extracted from above html. Example :
Tests : 103
Failures : 24
Success Rate : 76.70 %
and so on..
What I can do at the moment is to create a java program that will use sax parser or html parser such as jsoup to extract this info.
But using java here seems to be overhead with including the runnable jar inside the “wrapper” script you want to execute.
I’m sure that there must be “shell” languages out there that can do the same i.e. perl, python, bash etc.
My problem is that I have zero experience with these, can somebody help me resolve this “fairly easy” issue
Quick update:
I forgot to mention that I’ve got more tables and more rows in the .html document sorry about that (early morning).
Update #2:
Tried to install Bsoup like this since I don’t have root access :
$ wget http://www.crummy.com/software/BeautifulSoup/bs4/download/4.0/beautifulsoup4-4.1.0.tar.gz
$ tar -zxvf beautifulsoup4-4.1.0.tar.gz
$ cp -r beautifulsoup4-4.1.0/bs4 .
$ vi htmlParse.py # (paste code from ) Tichodromas' answer, just in case this (http://pastebin.com/4Je11Y9q) is what I pasted
$ run file (python htmlParse.py)
error:
$ python htmlParse.py
Traceback (most recent call last):
File "htmlParse.py", line 1, in ?
from bs4 import BeautifulSoup
File "/home/gdd/setup/py/bs4/__init__.py", line 29
from .builder import builder_registry
^
SyntaxError: invalid syntax
Update #3 :
Running Tichodromas’ answer get this error :
Traceback (most recent call last):
File "test.py", line 27, in ?
headings = [th.get_text() for th in table.find("tr").find_all("th")]
TypeError: 'NoneType' object is not callable
any ideas?
undef $/;
$text = <DATA>;
@tabs = $text =~ m!<table.*?>(.*?)</table>!gms;
for (@tabs) {
@th = m!<th>(.*?)</th>!gms;
@td = m!<td>(.*?)</td>!gms;
}
for $i (0..$#th) {
printf "%-16st: %sn", $th[$i], $td[$i];
}
__DATA__
<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
<tr valign="top">
<th>Tests</th>
<th>Failures</th>
<th>Success Rate</th>
<th>Average Time</th>
<th>Min Time</th>
<th>Max Time</th>
</tr>
<tr valign="top" class="Failure">
<td>103</td>
<td>24</td>
<td>76.70%</td>
<td>71 ms</td>
<td>0 ms</td>
<td>829 ms</td>
</tr>
</table>
output as follows:
Tests : 103
Failures : 24
Success Rate : 76.70%
Average Time : 71 ms
Min Time : 0 ms
Max Time : 829 ms
A Python solution using BeautifulSoup4 (Edit: with proper skipping. Edit3: Using class="details"
to select the table
):
from bs4 import BeautifulSoup
html = """
<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
<tr valign="top">
<th>Tests</th>
<th>Failures</th>
<th>Success Rate</th>
<th>Average Time</th>
<th>Min Time</th>
<th>Max Time</th>
</tr>
<tr valign="top" class="Failure">
<td>103</td>
<td>24</td>
<td>76.70%</td>
<td>71 ms</td>
<td>0 ms</td>
<td>829 ms</td>
</tr>
</table>"""
soup = BeautifulSoup(html)
table = soup.find("table", attrs={"class":"details"})
# The first tr contains the field names.
headings = [th.get_text() for th in table.find("tr").find_all("th")]
datasets = []
for row in table.find_all("tr")[1:]:
dataset = zip(headings, (td.get_text() for td in row.find_all("td")))
datasets.append(dataset)
print datasets
The result looks like this:
[[(u'Tests', u'103'),
(u'Failures', u'24'),
(u'Success Rate', u'76.70%'),
(u'Average Time', u'71 ms'),
(u'Min Time', u'0 ms'),
(u'Max Time', u'829 ms')]]
Edit2: To produce the desired output, use something like this:
for dataset in datasets:
for field in dataset:
print "{0:<16}: {1}".format(field[0], field[1])
Result:
Tests : 103
Failures : 24
Success Rate : 76.70%
Average Time : 71 ms
Min Time : 0 ms
Max Time : 829 ms
A Python solution that uses only the standard library (takes advantage of the fact that the HTML happens to be well-formed XML). More than one row of data can be handled.
(Tested with Python 2.6 and 2.7. The question was updated saying that the OP uses Python 2.4, so this answer may not be very useful in this case. ElementTree was added in Python 2.5)
from xml.etree.ElementTree import fromstring
HTML = """
<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
<tr valign="top">
<th>Tests</th>
<th>Failures</th>
<th>Success Rate</th>
<th>Average Time</th>
<th>Min Time</th>
<th>Max Time</th>
</tr>
<tr valign="top" class="Failure">
<td>103</td>
<td>24</td>
<td>76.70%</td>
<td>71 ms</td>
<td>0 ms</td>
<td>829 ms</td>
</tr>
<tr valign="top" class="whatever">
<td>A</td>
<td>B</td>
<td>C</td>
<td>D</td>
<td>E</td>
<td>F</td>
</tr>
</table>"""
tree = fromstring(HTML)
rows = tree.findall("tr")
headrow = rows[0]
datarows = rows[1:]
for num, h in enumerate(headrow):
data = ", ".join([row[num].text for row in datarows])
print "{0:<16}: {1}".format(h.text, data)
Output:
Tests : 103, A
Failures : 24, B
Success Rate : 76.70%, C
Average Time : 71 ms, D
Min Time : 0 ms, E
Max Time : 829 ms, F
Assuming your html code is stored in a mycode.html file, here is a bash way:
paste -d: <(grep '<th>' mycode.html | sed -e 's,</*th>,,g') <(grep '<td>' mycode.html | sed -e 's,</*td>,,g')
note: the output is not perfectly aligned
Here is the top answer, adapted for Python3 compatibility, and improved by stripping whitespace in cells:
from bs4 import BeautifulSoup
html = """
<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
<tr valign="top">
<th>Tests</th>
<th>Failures</th>
<th>Success Rate</th>
<th>Average Time</th>
<th>Min Time</th>
<th>Max Time</th>
</tr>
<tr valign="top" class="Failure">
<td>103</td>
<td>24</td>
<td>76.70%</td>
<td>71 ms</td>
<td>0 ms</td>
<td>829 ms</td>
</tr>
</table>"""
soup = BeautifulSoup(s, 'html.parser')
table = soup.find("table")
# The first tr contains the field names.
headings = [th.get_text().strip() for th in table.find("tr").find_all("th")]
print(headings)
datasets = []
for row in table.find_all("tr")[1:]:
dataset = dict(zip(headings, (td.get_text() for td in row.find_all("td"))))
datasets.append(dataset)
print(datasets)
Below is a python regex based solution that I have tested on python 2.7. It doesn’t rely on xml module–so will work in case xml is not fully well formed.
import re
# input args: html string
# output: tables as a list, column max length
def extract_html_tables(html):
tables=[]
maxlen=0
rex1=r'<table.*?/table>'
rex2=r'<tr.*?/tr>'
rex3=r'<(td|th).*?/(td|th)>'
s = re.search(rex1,html,re.DOTALL)
while s:
t = s.group() # the table
s2 = re.search(rex2,t,re.DOTALL)
table = []
while s2:
r = s2.group() # the row
s3 = re.search(rex3,r,re.DOTALL)
row=[]
while s3:
d = s3.group() # the cell
#row.append(strip_tags(d).strip() )
row.append(d.strip() )
r = re.sub(rex3,'',r,1,re.DOTALL)
s3 = re.search(rex3,r,re.DOTALL)
table.append( row )
if maxlen<len(row):
maxlen = len(row)
t = re.sub(rex2,'',t,1,re.DOTALL)
s2 = re.search(rex2,t,re.DOTALL)
html = re.sub(rex1,'',html,1,re.DOTALL)
tables.append(table)
s = re.search(rex1,html,re.DOTALL)
return tables, maxlen
html = """
<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
<tr valign="top">
<th>Tests</th>
<th>Failures</th>
<th>Success Rate</th>
<th>Average Time</th>
<th>Min Time</th>
<th>Max Time</th>
</tr>
<tr valign="top" class="Failure">
<td>103</td>
<td>24</td>
<td>76.70%</td>
<td>71 ms</td>
<td>0 ms</td>
<td>829 ms</td>
</tr>
</table>"""
print extract_html_tables(html)
Use pandas.read_html:
import pandas as pd
html_tables = pd.read_html('resources/test.html')
df = html_tables[0]
df.T # transpose to align
0
Tests 103
Failures 24
Success Rate 76.70%
Average Time 71 ms