Shuffling multiple HDF5 datasets in-place Shuffling multiple HDF5 datasets in-place numpy numpy

Shuffling multiple HDF5 datasets in-place


Shuffling arrays like this in numpy is straight forward

Create the large suffling index (shuffle np.arange(1000000)) and index the arrays

features = features[I, ...]labels = labels[I]info = info[I, :]

This isn't an inplace operation. labels[I] is a copy of labels, not a slice or view.

An alternative

features[I,...] = features

looks on the surface like it is an inplace operation. I doubt that it is, down in the C code. It has to be buffered, because the I values are not guaranteed to be unique. In fact there is a special ufunc .at method for unbuffered operations.

But look at what h5py says about this same sort of 'fancy indexing':

http://docs.h5py.org/en/latest/high/dataset.html#fancy-indexing

labels[I] selection is implemented, but with restrictions.

List selections may not be emptySelection coordinates must be given in increasing orderDuplicate selections are ignoredVery long lists (> 1000 elements) may produce poor performance

Your shuffled I is, by definition not in increasing order. And it is very large.

Also I don't see anything about using this fancy indexing on the left handside, labels[I] = ....


Shuffling arrays on disk will be time consuming, as it means that you have allocate new arrays in the hdf5 file, then copy all the rows in a different order. You can iterate over rows (or use chunks of rows), if you want to avoid loading all the data at once into memory with PyTables or h5py.

An alternative approach could be to keep your data as it is and simply to map new row numbers to old row numbers in a separate array (that you can keep fully loaded in RAM, since it will be only 4MB with your array sizes). For instance, to shuffle a numpy array x,

x = np.random.rand(5)idx_map = numpy.arange(x.shape[0])numpy.random.shuffle(idx_map)

Then you can use advanced numpy indexing to access your shuffled data,

x[idx_map[2]] # equivalent to x_shuffled[2]x[idx_map]    # equivament to x_shuffled[:], etc.

this will work also with arrays saved to hdf5. Of course there would be some overhead, as compared to writing shuffled arrays on disk, but it could be sufficient depending on your use-case.


import numpy as npimport h5pydata = h5py.File('original.h5py', 'r')with h5py.File('output.h5py', 'w') as out:    indexes = np.arange(data['some_dataset_in_original'].shape[0])    np.random.shuffle(indexes)    for key in data.keys():        print(key)        feed = np.take(data[key], indexes, axis=0)        out.create_dataset(key, data=feed)