Shuffling multiple HDF5 datasets in-place

Shuffling arrays like this in numpy is straight forward

Create the large suffling index (shuffle np.arange(1000000)) and index the arrays

features = features[I, ...]labels = labels[I]info = info[I, :]

This isn't an inplace operation. labels[I] is a copy of labels, not a slice or view.

An alternative

features[I,...] = features

looks on the surface like it is an inplace operation. I doubt that it is, down in the C code. It has to be buffered, because the I values are not guaranteed to be unique. In fact there is a special ufunc .at method for unbuffered operations.

But look at what h5py says about this same sort of 'fancy indexing':

http://docs.h5py.org/en/latest/high/dataset.html#fancy-indexing

labels[I] selection is implemented, but with restrictions.

List selections may not be emptySelection coordinates must be given in increasing orderDuplicate selections are ignoredVery long lists (> 1000 elements) may produce poor performance

Your shuffled I is, by definition not in increasing order. And it is very large.

Also I don't see anything about using this fancy indexing on the left handside, labels[I] = ....

python numpy hdf5 h5py

Shuffling arrays on disk will be time consuming, as it means that you have allocate new arrays in the hdf5 file, then copy all the rows in a different order. You can iterate over rows (or use chunks of rows), if you want to avoid loading all the data at once into memory with PyTables or h5py.

An alternative approach could be to keep your data as it is and simply to map new row numbers to old row numbers in a separate array (that you can keep fully loaded in RAM, since it will be only 4MB with your array sizes). For instance, to shuffle a numpy array x,

x = np.random.rand(5)idx_map = numpy.arange(x.shape[0])numpy.random.shuffle(idx_map)

Then you can use advanced numpy indexing to access your shuffled data,

x[idx_map[2]] # equivalent to x_shuffled[2]x[idx_map]    # equivament to x_shuffled[:], etc.

this will work also with arrays saved to hdf5. Of course there would be some overhead, as compared to writing shuffled arrays on disk, but it could be sufficient depending on your use-case.

python numpy hdf5 h5py

import numpy as npimport h5pydata = h5py.File('original.h5py', 'r')with h5py.File('output.h5py', 'w') as out:    indexes = np.arange(data['some_dataset_in_original'].shape[0])    np.random.shuffle(indexes)    for key in data.keys():        print(key)        feed = np.take(data[key], indexes, axis=0)        out.create_dataset(key, data=feed)

CodeHunter

Shuffling multiple HDF5 datasets in-place

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last