How to convert Wikipedia wikitable to Python Pandas DataFrame? How to convert Wikipedia wikitable to Python Pandas DataFrame? pandas pandas

How to convert Wikipedia wikitable to Python Pandas DataFrame?


Here's a solution using py-wikimarkup and PyQuery to extract all tables as pandas DataFrames from a wikimarkup string, ignoring non-table content.

import wikimarkupimport pandas as pdfrom pyquery import PyQuerydef get_tables(wiki):    html = PyQuery(wikimarkup.parse(wiki))    frames = []    for table in html('table'):        data = [[x.text.strip() for x in row]                for row in table.getchildren()]        df = pd.DataFrame(data[1:], columns=data[0])        frames.append(df)    return frames

Given the following input,

wiki = """=Title=Description.{| class="wikitable sortable"|-! Model !! Mhash/s !! Mhash/J !! Watts !! Clock !! SP !! Comment|-| ION || 1.8 || 0.067 || 27 ||  || 16 || poclbm;  power consumption incl. CPU|-| 8200 mGPU || 1.2 || || || 1200 || 16 || 128 MB shared memory, "poclbm -w 128 -f 0"|-| 8400 GS || 2.3 || || || || || "poclbm -w 128"|-|}{| class="wikitable sortable"|-! A !! B !! C|-| 0| 1| 2|-| 3| 4| 5|}"""

get_tables returns the following DataFrames.

       Model Mhash/s Mhash/J Watts Clock  SP                                     Comment0        ION     1.8   0.067    27        16        poclbm;  power consumption incl. CPU1  8200 mGPU     1.2                1200  16  128 MB shared memory, "poclbm -w 128 -f 0"2    8400 GS     2.3                                                     "poclbm -w 128"

 

   A  B  C0  0  1  21  3  4  5


You can use pandas directly. Something like this...

pandas.read_html(url, attrs={"class": "wikitable"})


Use re to do some preprocess, and then use read_csv to convert it to a DataFrame:

table = """{| class="wikitable sortable"|-! Model !! Mhash/s !! Mhash/J !! Watts !! Clock !! SP !! Comment|-| ION || 1.8 || 0.067 || 27 ||  || 16 || poclbm;  power consumption incl. CPU|-| 8200 mGPU || 1.2 || || || 1200 || 16 || 128 MB shared memory, "poclbm -w 128 -f 0"|-| 8400 GS || 2.3 || || ||  ||  || "poclbm -w 128"|-|}"""data = StringIO(re.sub("^\|.|^!.", "", table.replace("|-\n", ""), flags=re.MULTILINE))df = pd.read_csv(data, delimiter="\|\||!!", skiprows=1)

output:

       Model    Mhash/s   Mhash/J   Watts   Clock    SP                                       Comment0        ION         1.8    0.067      27            16          poclbm;  power consumption incl. CPU1  8200 mGPU         1.2                     1200    16    128 MB shared memory, "poclbm -w 128 -f 0"2    8400 GS         2.3                                                              "poclbm -w 128"