How to convert OpenDocument spreadsheets to a pandas DataFrame? How to convert OpenDocument spreadsheets to a pandas DataFrame? python python

How to convert OpenDocument spreadsheets to a pandas DataFrame?


This is available natively in pandas 0.25. So long as you have odfpy installed (conda install odfpy OR pip install odfpy) you can do

pd.read_excel("the_document.ods", engine="odf")


You can read ODF (Open Document Format .ods) documents in Python using the following modules:

Using ezodf, a simple ODS-to-DataFrame converter could look like this:

import pandas as pdimport ezodfdoc = ezodf.opendoc('some_odf_spreadsheet.ods')print("Spreadsheet contains %d sheet(s)." % len(doc.sheets))for sheet in doc.sheets:    print("-"*40)    print("   Sheet name : '%s'" % sheet.name)    print("Size of Sheet : (rows=%d, cols=%d)" % (sheet.nrows(), sheet.ncols()) )# convert the first sheet to a pandas.DataFramesheet = doc.sheets[0]df_dict = {}for i, row in enumerate(sheet.rows()):    # row is a list of cells    # assume the header is on the first row    if i == 0:        # columns as lists in a dictionary        df_dict = {cell.value:[] for cell in row}        # create index for the column headers        col_index = {j:cell.value for j, cell in enumerate(row)}        continue    for j, cell in enumerate(row):        # use header instead of column index        df_dict[col_index[j]].append(cell.value)# and convert to a DataFramedf = pd.DataFrame(df_dict)

P.S.

  • ODF spreadsheet (*.ods files) support has been requested on the pandas issue tracker: https://github.com/pydata/pandas/issues/2311, but it is still not implemented.

  • ezodf was used in the unfinished PR9070 to implement ODF support in pandas. That PR is now closed (read the PR for a technical discussion), but it is still available as an experimental feature in this pandas fork.

  • there are also some brute force methods to read directly from the XML code (here)


Here is a quick and dirty hack which uses ezodf module:

import pandas as pdimport ezodfdef read_ods(filename, sheet_no=0, header=0):    tab = ezodf.opendoc(filename=filename).sheets[sheet_no]    return pd.DataFrame({col[header].value:[x.value for x in col[header+1:]]                         for col in tab.columns()})

Test:

In [92]: df = read_ods(filename='fn.ods')In [93]: dfOut[93]:     a    b    c0  1.0  2.0  3.01  4.0  5.0  6.02  7.0  8.0  9.0

NOTES:

  • all other useful parameters like header, skiprows, index_col, parse_cols are NOT implemented in this function - feel free to update this question if you want to implement them
  • ezodf depends on lxml make sure you have it installed