Save pandas DataFrame using h5py for interoperabilty with other hdf5 readers
Here is my approach to solving this problem. I am hoping either someone else has a better solution or my approach is helpful to others.
First, define function to make a numpy structure array (not a record array) from a pandas DataFrame.
import numpy as npdef df_to_sarray(df): """ Convert a pandas DataFrame object to a numpy structured array. This is functionally equivalent to but more efficient than np.array(df.to_array()) :param df: the data frame to convert :return: a numpy structured array representation of df """ v = df.values cols = df.columns types = [(cols[i].encode(), df[k].dtype.type) for (i, k) in enumerate(cols)] dtype = np.dtype(types) z = np.zeros(v.shape[0], dtype) for (i, k) in enumerate(z.dtype.names): z[k] = v[:, i] return z
Use reset_index
to make a new data frame that includes the index as part of its data. Convert that data frame to a structure array.
sa = df_to_sarray(df.reset_index())saarray([(1L, nan, 0.2, nan), (2L, nan, nan, 0.5), (3L, nan, 0.2, 0.5), (4L, 0.1, 0.2, nan), (5L, 0.1, 0.2, 0.5), (6L, 0.1, nan, 0.5), (7L, 0.1, nan, nan)], dtype=[('ID', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])
Save that structured array to an hdf5 file.
import h5pywith h5py.File('mydata.h5', 'w') as hf: hf['df'] = sa
Load the h5 dataset
with h5py.File('mydata.h5') as hf: sa2 = hf['df'][:]
Extract the ID column and delete it from sa2
ID = sa2['ID']sa2 = nprec.drop_fields(sa2, 'ID')
Make data frame with index ID using sa2
df2 = pd.DataFrame(sa2, index=ID)df2.index.name = 'ID'print(df2) A B CID 1 NaN 0.2 NaN2 NaN NaN 0.53 NaN 0.2 0.54 0.1 0.2 NaN5 0.1 0.2 0.56 0.1 NaN 0.57 0.1 NaN NaN
The pandas HDFStore
format is standard HDF5, with just a convention for how to interpret the meta-data. Docs are here
In [54]: df.to_hdf('test.h5','df',mode='w',format='table',data_columns=True)In [55]: h = h5py.File('test.h5')In [56]: h['df']['table']Out[56]: <HDF5 dataset "table": shape (7,), type "|V32">In [64]: h['df']['table'][:]Out[64]: array([(1, nan, 0.2, nan), (2, nan, nan, 0.5), (3, nan, 0.2, 0.5), (4, 0.1, 0.2, nan), (5, 0.1, 0.2, 0.5), (6, 0.1, nan, 0.5), (7, 0.1, nan, nan)], dtype=[('index', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])In [57]: h['df']['table'].attrs.items()Out[57]: [(u'CLASS', 'TABLE'), (u'VERSION', '2.7'), (u'TITLE', ''), (u'FIELD_0_NAME', 'index'), (u'FIELD_1_NAME', 'A'), (u'FIELD_2_NAME', 'B'), (u'FIELD_3_NAME', 'C'), (u'FIELD_0_FILL', 0), (u'FIELD_1_FILL', 0.0), (u'FIELD_2_FILL', 0.0), (u'FIELD_3_FILL', 0.0), (u'index_kind', 'integer'), (u'A_kind', "(lp1\nS'A'\na."), (u'A_meta', 'N.'), (u'A_dtype', 'float64'), (u'B_kind', "(lp1\nS'B'\na."), (u'B_meta', 'N.'), (u'B_dtype', 'float64'), (u'C_kind', "(lp1\nS'C'\na."), (u'C_meta', 'N.'), (u'C_dtype', 'float64'), (u'NROWS', 7)]In [58]: h.close()
The data will be completely readable in any HDF5 reader. Some of the meta-data is pickled, so care must be taken.
In case it is helpful for anyone, I took this post from Guillaume and Phil, and changed it a bit for my needs with the help of ankostis. We read the pandas DataFrame from a CSV file.
Mainly I adapted it for Strings
, because you cannot store a object in a HDF5 file (I believe). Firstly check which columns types are numpy objects
. Then check which is the longest length of that column, and fix that column to be a String of that length. The rest is quite similar to the other post.
def df_to_sarray(df): """ Convert a pandas DataFrame object to a numpy structured array. Also, for every column of a str type, convert it into a 'bytes' str literal of length = max(len(col)). :param df: the data frame to convert :return: a numpy structured array representation of df """ def make_col_type(col_type, col): try: if 'numpy.object_' in str(col_type.type): maxlens = col.dropna().str.len() if maxlens.any(): maxlen = maxlens.max().astype(int) col_type = ('S%s' % maxlen, 1) else: col_type = 'f2' return col.name, col_type except: print(col.name, col_type, col_type.type, type(col)) raise v = df.values types = df.dtypes numpy_struct_types = [make_col_type(types[col], df.loc[:, col]) for col in df.columns] dtype = np.dtype(numpy_struct_types) z = np.zeros(v.shape[0], dtype) for (i, k) in enumerate(z.dtype.names): # This is in case you have problems with the encoding, remove the if branch if not try: if dtype[i].str.startswith('|S'): z[k] = df[k].str.encode('latin').astype('S') else: z[k] = v[:, i] except: print(k, v[:, i]) raise return z, dtype
So the workflow would be:
import h5pyimport pandas as pd# Read a CSV file# Here we assume col_dtypes is a dictionary that contains the dtypes of the columnsdf = pd.read_table('./data.csv', sep='\t', dtype=col_dtypes)# Transform the DataFrame into a structured numpy array and get the dtypesa, saType = df_to_sarray(df)# Open/create the HDF5 filef = h5py.File('test.hdf5', 'a')# Save the structured arrayf.create_dataset('someData', data=sa, dtype=saType)# Retrieve it and check it is ok when you transform it into a pandas DataFramesa2 = f['someData'][:]df2 = pd.DataFrame(sa2)print(df2.head())f.close()
Also, in this way you are able to see it from HDFView even when using gzip
compression for instance.