Read HDF5 file into numpy array
The easiest thing is to use the .value
attribute of the HDF5 dataset.
>>> hf = h5py.File('/path/to/file', 'r')>>> data = hf.get('dataset_name').value # `data` is now an ndarray.
You can also slice the dataset, which produces an actual ndarray with the requested data:
>>> hf['dataset_name'][:10] # produces ndarray as well
But keep in mind that in many ways the h5py
dataset acts like an ndarray
. So you can pass the dataset itself unchanged to most, if not all, NumPy functions. So, for example, this works just fine: np.mean(hf.get('dataset_name'))
.
EDIT:
I misunderstood the question originally. The problem isn't loading the numerical data, it's that the dataset actually contains HDF5 references. This is a strange setup, and it's kind of awkward to read in h5py
. You need to dereference each reference in the dataset. I'll show it for just one of them.
First, let's create a file and a temporary dataset:
>>> f = h5py.File('tmp.h5', 'w')>>> ds = f.create_dataset('data', data=np.zeros(10,))
Next, create a reference to it and store a few of them in a dataset.
>>> ref_dtype = h5py.special_dtype(ref=h5py.Reference)>>> ref_ds = f.create_dataset('data_refs', data=(ds.ref, ds.ref), dtype=ref_dtype)
Then you can read one of these back, in a circuitous way, by getting its name ,and then reading from that actual dataset that is referenced.
>>> name = h5py.h5r.get_name(ref_ds[0], f.id) # 2nd argument is the file identifier>>> print(name)b'/data'>>> out = f[name]>>> print(out.shape)(10,)
It's round-about, but it seems to work. The TL;DR is: get the name of the referenced dataset, and read directly from that.
Note:
The h5py.h5r.dereference
function seems pretty unhelpful here, despite the name. It returns the ID of the referenced object. This can be read from directly, but it's very easy to cause a crash in this case (I did it several times in this contrived example here). Getting the name and reading from that is much easier.
Note 2:
As stated in the release notes for h5py 2.1, the use of Dataset.value
property is deprecated and should be replaced by using mydataset[...]
or mydataset[()]
as appropriate.
The property
Dataset.value
, which dates back to h5py 1.0, is deprecated and will be removed in a later release. This property dumps the entire dataset into a NumPy array. Code using.value
should be updated to use NumPy indexing, usingmydataset[...]
ormydataset[()]
as appropriate.
h5py provides intrinsic method for such tasks: read_direct()
hf = h5py.File('path/to/file', 'r')n1 = np.zeros(shape, dtype=numpy_type)hf['dataset_name'].read_direct(n1)hf.close()
The combined steps are still faster than n1 = np.array(hf['dataset_name'])
if you %timeit
. The only drawback is, one needs to know the shape of the dataset beforehand, which can be assigned as an attribute by the data provider.