How to convert Wikipedia wikitable to Python Pandas DataFrame?
Here's a solution using py-wikimarkup and PyQuery to extract all tables as pandas DataFrames from a wikimarkup string, ignoring non-table content.
import wikimarkupimport pandas as pdfrom pyquery import PyQuerydef get_tables(wiki): html = PyQuery(wikimarkup.parse(wiki)) frames = [] for table in html('table'): data = [[x.text.strip() for x in row] for row in table.getchildren()] df = pd.DataFrame(data[1:], columns=data[0]) frames.append(df) return frames
Given the following input,
wiki = """=Title=Description.{| class="wikitable sortable"|-! Model !! Mhash/s !! Mhash/J !! Watts !! Clock !! SP !! Comment|-| ION || 1.8 || 0.067 || 27 || || 16 || poclbm; power consumption incl. CPU|-| 8200 mGPU || 1.2 || || || 1200 || 16 || 128 MB shared memory, "poclbm -w 128 -f 0"|-| 8400 GS || 2.3 || || || || || "poclbm -w 128"|-|}{| class="wikitable sortable"|-! A !! B !! C|-| 0| 1| 2|-| 3| 4| 5|}"""
get_tables
returns the following DataFrames.
Model Mhash/s Mhash/J Watts Clock SP Comment0 ION 1.8 0.067 27 16 poclbm; power consumption incl. CPU1 8200 mGPU 1.2 1200 16 128 MB shared memory, "poclbm -w 128 -f 0"2 8400 GS 2.3 "poclbm -w 128"
A B C0 0 1 21 3 4 5
Use re
to do some preprocess, and then use read_csv
to convert it to a DataFrame
:
table = """{| class="wikitable sortable"|-! Model !! Mhash/s !! Mhash/J !! Watts !! Clock !! SP !! Comment|-| ION || 1.8 || 0.067 || 27 || || 16 || poclbm; power consumption incl. CPU|-| 8200 mGPU || 1.2 || || || 1200 || 16 || 128 MB shared memory, "poclbm -w 128 -f 0"|-| 8400 GS || 2.3 || || || || || "poclbm -w 128"|-|}"""data = StringIO(re.sub("^\|.|^!.", "", table.replace("|-\n", ""), flags=re.MULTILINE))df = pd.read_csv(data, delimiter="\|\||!!", skiprows=1)
output:
Model Mhash/s Mhash/J Watts Clock SP Comment0 ION 1.8 0.067 27 16 poclbm; power consumption incl. CPU1 8200 mGPU 1.2 1200 16 128 MB shared memory, "poclbm -w 128 -f 0"2 8400 GS 2.3 "poclbm -w 128"