Save pandas DataFrame using h5py for interoperabilty with other hdf5 readers Save pandas DataFrame using h5py for interoperabilty with other hdf5 readers numpy numpy

Save pandas DataFrame using h5py for interoperabilty with other hdf5 readers


Here is my approach to solving this problem. I am hoping either someone else has a better solution or my approach is helpful to others.

First, define function to make a numpy structure array (not a record array) from a pandas DataFrame.

import numpy as npdef df_to_sarray(df):    """    Convert a pandas DataFrame object to a numpy structured array.    This is functionally equivalent to but more efficient than    np.array(df.to_array())    :param df: the data frame to convert    :return: a numpy structured array representation of df    """    v = df.values    cols = df.columns    types = [(cols[i].encode(), df[k].dtype.type) for (i, k) in enumerate(cols)]    dtype = np.dtype(types)    z = np.zeros(v.shape[0], dtype)    for (i, k) in enumerate(z.dtype.names):        z[k] = v[:, i]    return z

Use reset_index to make a new data frame that includes the index as part of its data. Convert that data frame to a structure array.

sa = df_to_sarray(df.reset_index())saarray([(1L, nan, 0.2, nan), (2L, nan, nan, 0.5), (3L, nan, 0.2, 0.5),       (4L, 0.1, 0.2, nan), (5L, 0.1, 0.2, 0.5), (6L, 0.1, nan, 0.5),       (7L, 0.1, nan, nan)],       dtype=[('ID', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])

Save that structured array to an hdf5 file.

import h5pywith h5py.File('mydata.h5', 'w') as hf:            hf['df'] = sa

Load the h5 dataset

with h5py.File('mydata.h5') as hf:            sa2 = hf['df'][:]

Extract the ID column and delete it from sa2

ID = sa2['ID']sa2 = nprec.drop_fields(sa2, 'ID')

Make data frame with index ID using sa2

df2 = pd.DataFrame(sa2, index=ID)df2.index.name = 'ID'print(df2)      A    B    CID               1   NaN  0.2  NaN2   NaN  NaN  0.53   NaN  0.2  0.54   0.1  0.2  NaN5   0.1  0.2  0.56   0.1  NaN  0.57   0.1  NaN  NaN


The pandas HDFStore format is standard HDF5, with just a convention for how to interpret the meta-data. Docs are here

In [54]: df.to_hdf('test.h5','df',mode='w',format='table',data_columns=True)In [55]: h = h5py.File('test.h5')In [56]: h['df']['table']Out[56]: <HDF5 dataset "table": shape (7,), type "|V32">In [64]: h['df']['table'][:]Out[64]: array([(1, nan, 0.2, nan), (2, nan, nan, 0.5), (3, nan, 0.2, 0.5),       (4, 0.1, 0.2, nan), (5, 0.1, 0.2, 0.5), (6, 0.1, nan, 0.5),       (7, 0.1, nan, nan)],       dtype=[('index', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])In [57]: h['df']['table'].attrs.items()Out[57]: [(u'CLASS', 'TABLE'), (u'VERSION', '2.7'), (u'TITLE', ''), (u'FIELD_0_NAME', 'index'), (u'FIELD_1_NAME', 'A'), (u'FIELD_2_NAME', 'B'), (u'FIELD_3_NAME', 'C'), (u'FIELD_0_FILL', 0), (u'FIELD_1_FILL', 0.0), (u'FIELD_2_FILL', 0.0), (u'FIELD_3_FILL', 0.0), (u'index_kind', 'integer'), (u'A_kind', "(lp1\nS'A'\na."), (u'A_meta', 'N.'), (u'A_dtype', 'float64'), (u'B_kind', "(lp1\nS'B'\na."), (u'B_meta', 'N.'), (u'B_dtype', 'float64'), (u'C_kind', "(lp1\nS'C'\na."), (u'C_meta', 'N.'), (u'C_dtype', 'float64'), (u'NROWS', 7)]In [58]: h.close()

The data will be completely readable in any HDF5 reader. Some of the meta-data is pickled, so care must be taken.


In case it is helpful for anyone, I took this post from Guillaume and Phil, and changed it a bit for my needs with the help of ankostis. We read the pandas DataFrame from a CSV file.

Mainly I adapted it for Strings, because you cannot store a object in a HDF5 file (I believe). Firstly check which columns types are numpy objects. Then check which is the longest length of that column, and fix that column to be a String of that length. The rest is quite similar to the other post.

def df_to_sarray(df):    """    Convert a pandas DataFrame object to a numpy structured array.    Also, for every column of a str type, convert it into     a 'bytes' str literal of length = max(len(col)).    :param df: the data frame to convert    :return: a numpy structured array representation of df    """    def make_col_type(col_type, col):        try:            if 'numpy.object_' in str(col_type.type):                maxlens = col.dropna().str.len()                if maxlens.any():                    maxlen = maxlens.max().astype(int)                     col_type = ('S%s' % maxlen, 1)                else:                    col_type = 'f2'            return col.name, col_type        except:            print(col.name, col_type, col_type.type, type(col))            raise    v = df.values                types = df.dtypes    numpy_struct_types = [make_col_type(types[col], df.loc[:, col]) for col in df.columns]    dtype = np.dtype(numpy_struct_types)    z = np.zeros(v.shape[0], dtype)    for (i, k) in enumerate(z.dtype.names):        # This is in case you have problems with the encoding, remove the if branch if not        try:            if dtype[i].str.startswith('|S'):                z[k] = df[k].str.encode('latin').astype('S')            else:                z[k] = v[:, i]        except:            print(k, v[:, i])            raise    return z, dtype

So the workflow would be:

import h5pyimport pandas as pd# Read a CSV file# Here we assume col_dtypes is a dictionary that contains the dtypes of the columnsdf = pd.read_table('./data.csv', sep='\t', dtype=col_dtypes)# Transform the DataFrame into a structured numpy array and get the dtypesa, saType = df_to_sarray(df)# Open/create the HDF5 filef = h5py.File('test.hdf5', 'a')# Save the structured arrayf.create_dataset('someData', data=sa, dtype=saType)# Retrieve it and check it is ok when you transform it into a pandas DataFramesa2 = f['someData'][:]df2 = pd.DataFrame(sa2)print(df2.head())f.close()

Also, in this way you are able to see it from HDFView even when using gzip compression for instance.