h5py: Correct way to slice array datasets h5py: Correct way to slice array datasets numpy numpy

h5py: Correct way to slice array datasets


For fast slicing with h5py, stick to the "plain-vanilla" slice notation:

file['test'][0:300000]

or, for example, reading every other element:

file['test'][0:300000:2]

Simple slicing (slice objects and single integer indices) should be very fast, as it translates directly into HDF5 hyperslab selections.

The expression file['test'][range(300000)] invokes h5py's version of "fancy indexing", namely, indexing via an explicit list of indices. There's no native way to do this in HDF5, so h5py implements a (slower) method in Python, which unfortunately has abysmal performance when the lists are > 1000 elements. Likewise for file['test'][np.arange(300000)], which is interpreted in the same way.

See also:

[1] http://docs.h5py.org/en/latest/high/dataset.html#fancy-indexing

[2] https://github.com/h5py/h5py/issues/293


The .value method is copying the data to memory as a numpy array. Try comparing type(file["test"]) with type(file["test"].value): the former should be an HDF5 dataset, the latter a numpy array.

I'm not familiar enough with the h5py or HDF5 internals to tell you exactly why certain dataset operations are slow; but the reason those two are different is that in one case you're slicing a numpy array in memory, and in the other slicing an HDF5 dataset from disk.


Based on the title of your post, the 'correct' way to slice array datasets is to use the builtin slice notation

All of your answers would be equivalent to file["test"][:]

[:] selects all elements in the array

More information about slicing notation can be found here, http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html

I use hdf5 + python often, I've never had to use the .value methods. When you access a dataset in an array like such asmyarr = file["test"]

python copies the dataset in the hdf5 into an array for you already.