Numpy shuffle multidimensional array by row only, keep column order unchanged
You can use numpy.random.shuffle()
.
This function only shuffles the array along the first axis of amulti-dimensional array. The order of sub-arrays is changed but theircontents remains the same.
In [2]: import numpy as np In [3]: In [3]: X = np.random.random((6, 2)) In [4]: X Out[4]: array([[0.71935047, 0.25796155], [0.4621708 , 0.55140423], [0.22605866, 0.61581771], [0.47264172, 0.79307633], [0.22701656, 0.11927993], [0.20117207, 0.2754544 ]])In [5]: np.random.shuffle(X) In [6]: X Out[6]: array([[0.71935047, 0.25796155], [0.47264172, 0.79307633], [0.4621708 , 0.55140423], [0.22701656, 0.11927993], [0.20117207, 0.2754544 ], [0.22605866, 0.61581771]])
For other functionalities you can also check out the following functions:
The function random.Generator.permuted
is introduced in Numpy's 1.20.0 Release.
The new function differs from
shuffle
andpermutation
in that thesubarrays indexed by an axis are permuted rather than the axis beingtreated as a separate 1-D array for every combination of the otherindexes. For example, it is now possible to permute the rows orcolumns of a 2-D array.
You can also use np.random.permutation
to generate random permutation of row indices and then index into the rows of X
using np.take
with axis=0
. Also, np.take
facilitates overwriting to the input array X
itself with out=
option, which would save us memory. Thus, the implementation would look like this -
np.take(X,np.random.permutation(X.shape[0]),axis=0,out=X)
Sample run -
In [23]: XOut[23]: array([[ 0.60511059, 0.75001599], [ 0.30968339, 0.09162172], [ 0.14673218, 0.09089028], [ 0.31663128, 0.10000309], [ 0.0957233 , 0.96210485], [ 0.56843186, 0.36654023]])In [24]: np.take(X,np.random.permutation(X.shape[0]),axis=0,out=X);In [25]: XOut[25]: array([[ 0.14673218, 0.09089028], [ 0.31663128, 0.10000309], [ 0.30968339, 0.09162172], [ 0.56843186, 0.36654023], [ 0.0957233 , 0.96210485], [ 0.60511059, 0.75001599]])
Additional performance boost
Here's a trick to speed up np.random.permutation(X.shape[0])
with np.argsort()
-
np.random.rand(X.shape[0]).argsort()
Speedup results -
In [32]: X = np.random.random((6000, 2000))In [33]: %timeit np.random.permutation(X.shape[0])1000 loops, best of 3: 510 µs per loopIn [34]: %timeit np.random.rand(X.shape[0]).argsort()1000 loops, best of 3: 297 µs per loop
Thus, the shuffling solution could be modified to -
np.take(X,np.random.rand(X.shape[0]).argsort(),axis=0,out=X)
Runtime tests -
These tests include the two approaches listed in this post and np.shuffle
based one in @Kasramvd's solution
.
In [40]: X = np.random.random((6000, 2000))In [41]: %timeit np.random.shuffle(X)10 loops, best of 3: 25.2 ms per loopIn [42]: %timeit np.take(X,np.random.permutation(X.shape[0]),axis=0,out=X)10 loops, best of 3: 53.3 ms per loopIn [43]: %timeit np.take(X,np.random.rand(X.shape[0]).argsort(),axis=0,out=X)10 loops, best of 3: 53.2 ms per loop
So, it seems using these np.take
based could be used only if memory is a concern or else np.random.shuffle
based solution looks like the way to go.
After a bit of experiment (i) found the most memory and time-efficient way to shuffle data(row-wise)in an nD array. First, shuffle the index of an array then, use the shuffled index to get the data. e.g.
rand_num2 = np.random.randint(5, size=(6000, 2000))perm = np.arange(rand_num2.shape[0])np.random.shuffle(perm)rand_num2 = rand_num2[perm]
in more details
Here, I am using memory_profiler to find memory usage and python's builtin "time" module to record time and comparing all previous answers
def main(): # shuffle data itself rand_num = np.random.randint(5, size=(6000, 2000)) start = time.time() np.random.shuffle(rand_num) print('Time for direct shuffle: {0}'.format((time.time() - start))) # Shuffle index and get data from shuffled index rand_num2 = np.random.randint(5, size=(6000, 2000)) start = time.time() perm = np.arange(rand_num2.shape[0]) np.random.shuffle(perm) rand_num2 = rand_num2[perm] print('Time for shuffling index: {0}'.format((time.time() - start))) # using np.take() rand_num3 = np.random.randint(5, size=(6000, 2000)) start = time.time() np.take(rand_num3, np.random.rand(rand_num3.shape[0]).argsort(), axis=0, out=rand_num3) print("Time taken by np.take, {0}".format((time.time() - start)))
Result for Time
Time for direct shuffle: 0.03345608711242676 # 33.4msecTime for shuffling index: 0.019818782806396484 # 19.8msecTime taken by np.take, 0.06726956367492676 # 67.2msec
Memory profiler Result
Line # Mem usage Increment Line Contents================================================ 39 117.422 MiB 0.000 MiB @profile 40 def main(): 41 # shuffle data itself 42 208.977 MiB 91.555 MiB rand_num = np.random.randint(5, size=(6000, 2000)) 43 208.977 MiB 0.000 MiB start = time.time() 44 208.977 MiB 0.000 MiB np.random.shuffle(rand_num) 45 208.977 MiB 0.000 MiB print('Time for direct shuffle: {0}'.format((time.time() - start))) 46 47 # Shuffle index and get data from shuffled index 48 300.531 MiB 91.555 MiB rand_num2 = np.random.randint(5, size=(6000, 2000)) 49 300.531 MiB 0.000 MiB start = time.time() 50 300.535 MiB 0.004 MiB perm = np.arange(rand_num2.shape[0]) 51 300.539 MiB 0.004 MiB np.random.shuffle(perm) 52 300.539 MiB 0.000 MiB rand_num2 = rand_num2[perm] 53 300.539 MiB 0.000 MiB print('Time for shuffling index: {0}'.format((time.time() - start))) 54 55 # using np.take() 56 392.094 MiB 91.555 MiB rand_num3 = np.random.randint(5, size=(6000, 2000)) 57 392.094 MiB 0.000 MiB start = time.time() 58 392.242 MiB 0.148 MiB np.take(rand_num3, np.random.rand(rand_num3.shape[0]).argsort(), axis=0, out=rand_num3) 59 392.242 MiB 0.000 MiB print("Time taken by np.take, {0}".format((time.time() - start)))