Extracting data from HTML table
A Python solution using BeautifulSoup4 (Edit: with proper skipping. Edit3: Using class="details"
to select the table
):
from bs4 import BeautifulSouphtml = """ <table class="details" border="0" cellpadding="5" cellspacing="2" width="95%"> <tr valign="top"> <th>Tests</th> <th>Failures</th> <th>Success Rate</th> <th>Average Time</th> <th>Min Time</th> <th>Max Time</th> </tr> <tr valign="top" class="Failure"> <td>103</td> <td>24</td> <td>76.70%</td> <td>71 ms</td> <td>0 ms</td> <td>829 ms</td> </tr></table>"""soup = BeautifulSoup(html)table = soup.find("table", attrs={"class":"details"})# The first tr contains the field names.headings = [th.get_text() for th in table.find("tr").find_all("th")]datasets = []for row in table.find_all("tr")[1:]: dataset = zip(headings, (td.get_text() for td in row.find_all("td"))) datasets.append(dataset)print datasets
The result looks like this:
[[(u'Tests', u'103'), (u'Failures', u'24'), (u'Success Rate', u'76.70%'), (u'Average Time', u'71 ms'), (u'Min Time', u'0 ms'), (u'Max Time', u'829 ms')]]
Edit2: To produce the desired output, use something like this:
for dataset in datasets: for field in dataset: print "{0:<16}: {1}".format(field[0], field[1])
Result:
Tests : 103Failures : 24Success Rate : 76.70%Average Time : 71 msMin Time : 0 msMax Time : 829 ms
Use pandas.read_html:
import pandas as pdhtml_tables = pd.read_html('resources/test.html')df = html_tables[0]df.T # transpose to align 0Tests 103Failures 24Success Rate 76.70%Average Time 71 ms
Here is the top answer, adapted for Python3 compatibility, and improved by stripping whitespace in cells:
from bs4 import BeautifulSouphtml = """ <table class="details" border="0" cellpadding="5" cellspacing="2" width="95%"> <tr valign="top"> <th>Tests</th> <th>Failures</th> <th>Success Rate</th> <th>Average Time</th> <th>Min Time</th> <th>Max Time</th> </tr> <tr valign="top" class="Failure"> <td>103</td> <td>24</td> <td>76.70%</td> <td>71 ms</td> <td>0 ms</td> <td>829 ms</td> </tr></table>"""soup = BeautifulSoup(s, 'html.parser')table = soup.find("table")# The first tr contains the field names.headings = [th.get_text().strip() for th in table.find("tr").find_all("th")]print(headings)datasets = []for row in table.find_all("tr")[1:]: dataset = dict(zip(headings, (td.get_text() for td in row.find_all("td")))) datasets.append(dataset)print(datasets)