Numpy shuffle multidimensional array by row only, keep column order unchanged Numpy shuffle multidimensional array by row only, keep column order unchanged python python

Numpy shuffle multidimensional array by row only, keep column order unchanged


You can use numpy.random.shuffle().

This function only shuffles the array along the first axis of amulti-dimensional array. The order of sub-arrays is changed but theircontents remains the same.

In [2]: import numpy as np                                                                                                                                                                                  In [3]:                                                                                                                                                                                                     In [3]: X = np.random.random((6, 2))                                                                                                                                                                        In [4]: X                                                                                                                                                                                                   Out[4]: array([[0.71935047, 0.25796155],       [0.4621708 , 0.55140423],       [0.22605866, 0.61581771],       [0.47264172, 0.79307633],       [0.22701656, 0.11927993],       [0.20117207, 0.2754544 ]])In [5]: np.random.shuffle(X)                                                                                                                                                                                In [6]: X                                                                                                                                                                                                   Out[6]: array([[0.71935047, 0.25796155],       [0.47264172, 0.79307633],       [0.4621708 , 0.55140423],       [0.22701656, 0.11927993],       [0.20117207, 0.2754544 ],       [0.22605866, 0.61581771]])

For other functionalities you can also check out the following functions:

The function random.Generator.permuted is introduced in Numpy's 1.20.0 Release.

The new function differs from shuffle and permutation in that thesubarrays indexed by an axis are permuted rather than the axis beingtreated as a separate 1-D array for every combination of the otherindexes. For example, it is now possible to permute the rows orcolumns of a 2-D array.


You can also use np.random.permutation to generate random permutation of row indices and then index into the rows of X using np.take with axis=0. Also, np.take facilitates overwriting to the input array X itself with out= option, which would save us memory. Thus, the implementation would look like this -

np.take(X,np.random.permutation(X.shape[0]),axis=0,out=X)

Sample run -

In [23]: XOut[23]: array([[ 0.60511059,  0.75001599],       [ 0.30968339,  0.09162172],       [ 0.14673218,  0.09089028],       [ 0.31663128,  0.10000309],       [ 0.0957233 ,  0.96210485],       [ 0.56843186,  0.36654023]])In [24]: np.take(X,np.random.permutation(X.shape[0]),axis=0,out=X);In [25]: XOut[25]: array([[ 0.14673218,  0.09089028],       [ 0.31663128,  0.10000309],       [ 0.30968339,  0.09162172],       [ 0.56843186,  0.36654023],       [ 0.0957233 ,  0.96210485],       [ 0.60511059,  0.75001599]])

Additional performance boost

Here's a trick to speed up np.random.permutation(X.shape[0]) with np.argsort() -

np.random.rand(X.shape[0]).argsort()

Speedup results -

In [32]: X = np.random.random((6000, 2000))In [33]: %timeit np.random.permutation(X.shape[0])1000 loops, best of 3: 510 µs per loopIn [34]: %timeit np.random.rand(X.shape[0]).argsort()1000 loops, best of 3: 297 µs per loop

Thus, the shuffling solution could be modified to -

np.take(X,np.random.rand(X.shape[0]).argsort(),axis=0,out=X)

Runtime tests -

These tests include the two approaches listed in this post and np.shuffle based one in @Kasramvd's solution.

In [40]: X = np.random.random((6000, 2000))In [41]: %timeit np.random.shuffle(X)10 loops, best of 3: 25.2 ms per loopIn [42]: %timeit np.take(X,np.random.permutation(X.shape[0]),axis=0,out=X)10 loops, best of 3: 53.3 ms per loopIn [43]: %timeit np.take(X,np.random.rand(X.shape[0]).argsort(),axis=0,out=X)10 loops, best of 3: 53.2 ms per loop

So, it seems using these np.take based could be used only if memory is a concern or else np.random.shuffle based solution looks like the way to go.


After a bit of experiment (i) found the most memory and time-efficient way to shuffle data(row-wise)in an nD array. First, shuffle the index of an array then, use the shuffled index to get the data. e.g.

rand_num2 = np.random.randint(5, size=(6000, 2000))perm = np.arange(rand_num2.shape[0])np.random.shuffle(perm)rand_num2 = rand_num2[perm]

in more details
Here, I am using memory_profiler to find memory usage and python's builtin "time" module to record time and comparing all previous answers

def main():    # shuffle data itself    rand_num = np.random.randint(5, size=(6000, 2000))    start = time.time()    np.random.shuffle(rand_num)    print('Time for direct shuffle: {0}'.format((time.time() - start)))        # Shuffle index and get data from shuffled index    rand_num2 = np.random.randint(5, size=(6000, 2000))    start = time.time()    perm = np.arange(rand_num2.shape[0])    np.random.shuffle(perm)    rand_num2 = rand_num2[perm]    print('Time for shuffling index: {0}'.format((time.time() - start)))        # using np.take()    rand_num3 = np.random.randint(5, size=(6000, 2000))    start = time.time()    np.take(rand_num3, np.random.rand(rand_num3.shape[0]).argsort(), axis=0, out=rand_num3)    print("Time taken by np.take, {0}".format((time.time() - start)))

Result for Time

Time for direct shuffle: 0.03345608711242676   # 33.4msecTime for shuffling index: 0.019818782806396484 # 19.8msecTime taken by np.take, 0.06726956367492676     # 67.2msec

Memory profiler Result

Line #    Mem usage    Increment   Line Contents================================================    39  117.422 MiB    0.000 MiB   @profile    40                             def main():    41                                 # shuffle data itself    42  208.977 MiB   91.555 MiB       rand_num = np.random.randint(5, size=(6000, 2000))    43  208.977 MiB    0.000 MiB       start = time.time()    44  208.977 MiB    0.000 MiB       np.random.shuffle(rand_num)    45  208.977 MiB    0.000 MiB       print('Time for direct shuffle: {0}'.format((time.time() - start)))    46                                 47                                 # Shuffle index and get data from shuffled index    48  300.531 MiB   91.555 MiB       rand_num2 = np.random.randint(5, size=(6000, 2000))    49  300.531 MiB    0.000 MiB       start = time.time()    50  300.535 MiB    0.004 MiB       perm = np.arange(rand_num2.shape[0])    51  300.539 MiB    0.004 MiB       np.random.shuffle(perm)    52  300.539 MiB    0.000 MiB       rand_num2 = rand_num2[perm]    53  300.539 MiB    0.000 MiB       print('Time for shuffling index: {0}'.format((time.time() - start)))    54                                 55                                 # using np.take()    56  392.094 MiB   91.555 MiB       rand_num3 = np.random.randint(5, size=(6000, 2000))    57  392.094 MiB    0.000 MiB       start = time.time()    58  392.242 MiB    0.148 MiB       np.take(rand_num3, np.random.rand(rand_num3.shape[0]).argsort(), axis=0, out=rand_num3)    59  392.242 MiB    0.000 MiB       print("Time taken by np.take, {0}".format((time.time() - start)))