Storing a list of strings to a HDF5 Dataset from Python Storing a list of strings to a HDF5 Dataset from Python python python

Storing a list of strings to a HDF5 Dataset from Python


You're reading in Unicode strings, but specifying your datatype as ASCII. According to the h5py wiki, h5py does not currently support this conversion.

You'll need to encode the strings in a format h5py handles:

asciiList = [n.encode("ascii", "ignore") for n in strList]h5File.create_dataset('xxx', (len(asciiList),1),'S10', asciiList)

Note: not everything encoded in UTF-8 can be encoded in ASCII!


From https://docs.h5py.org/en/stable/special.html:

In HDF5, data in VL format is stored as arbitrary-length vectors of abase type. In particular, strings are stored C-style innull-terminated buffers. NumPy has no native mechanism to supportthis. Unfortunately, this is the de facto standard for representingstrings in the HDF5 C API, and in many HDF5 applications.

Thankfully, NumPy has a generic pointer type in the form of the“object” (“O”) dtype. In h5py, variable-length strings are mapped toobject arrays. A small amount of metadata attached to an “O” dtypetells h5py that its contents should be converted to VL strings whenstored in the file.

Existing VL strings can be read and written to with no additionaleffort; Python strings and fixed-length NumPy strings can beauto-converted to VL data and stored.

Example

In [27]: dt = h5py.special_dtype(vlen=str)In [28]: dset = h5File.create_dataset('vlen_str', (100,), dtype=dt)In [29]: dset[0] = 'the change of water into water vapour'In [30]: dset[0]Out[30]: 'the change of water into water vapour'


I am in a similar situation wanting to store column names of dataframe as a dataset in hdf5 file. Assuming df.columns is what I want to store, I found the following works:

h5File = h5py.File('my_file.h5','w')h5File['col_names'] = df.columns.values.astype('S')

This assumes the column names are 'simple' strings that can be encoded in ASCII.