Extracting data from HTML table Extracting data from HTML table bash bash

Extracting data from HTML table


A Python solution using BeautifulSoup4 (Edit: with proper skipping. Edit3: Using class="details" to select the table):

from bs4 import BeautifulSouphtml = """  <table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">    <tr valign="top">      <th>Tests</th>      <th>Failures</th>      <th>Success Rate</th>      <th>Average Time</th>      <th>Min Time</th>      <th>Max Time</th>   </tr>   <tr valign="top" class="Failure">     <td>103</td>     <td>24</td>     <td>76.70%</td>     <td>71 ms</td>     <td>0 ms</td>     <td>829 ms</td>  </tr></table>"""soup = BeautifulSoup(html)table = soup.find("table", attrs={"class":"details"})# The first tr contains the field names.headings = [th.get_text() for th in table.find("tr").find_all("th")]datasets = []for row in table.find_all("tr")[1:]:    dataset = zip(headings, (td.get_text() for td in row.find_all("td")))    datasets.append(dataset)print datasets

The result looks like this:

[[(u'Tests', u'103'),  (u'Failures', u'24'),  (u'Success Rate', u'76.70%'),  (u'Average Time', u'71 ms'),  (u'Min Time', u'0 ms'),  (u'Max Time', u'829 ms')]]

Edit2: To produce the desired output, use something like this:

for dataset in datasets:    for field in dataset:        print "{0:<16}: {1}".format(field[0], field[1])

Result:

Tests           : 103Failures        : 24Success Rate    : 76.70%Average Time    : 71 msMin Time        : 0 msMax Time        : 829 ms


Use pandas.read_html:

import pandas as pdhtml_tables = pd.read_html('resources/test.html')df = html_tables[0]df.T # transpose to align                   0Tests            103Failures          24Success Rate  76.70%Average Time   71 ms


Here is the top answer, adapted for Python3 compatibility, and improved by stripping whitespace in cells:

from bs4 import BeautifulSouphtml = """  <table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">    <tr valign="top">      <th>Tests</th>      <th>Failures</th>      <th>Success Rate</th>      <th>Average Time</th>      <th>Min Time</th>      <th>Max Time</th>   </tr>   <tr valign="top" class="Failure">     <td>103</td>     <td>24</td>     <td>76.70%</td>     <td>71 ms</td>     <td>0 ms</td>     <td>829 ms</td>  </tr></table>"""soup = BeautifulSoup(s, 'html.parser')table = soup.find("table")# The first tr contains the field names.headings = [th.get_text().strip() for th in table.find("tr").find_all("th")]print(headings)datasets = []for row in table.find_all("tr")[1:]:    dataset = dict(zip(headings, (td.get_text() for td in row.find_all("td"))))    datasets.append(dataset)print(datasets)