Shuffle DataFrame rows Shuffle DataFrame rows python python

Shuffle DataFrame rows


The idiomatic way to do this with Pandas is to use the .sample method of your dataframe to sample all rows without replacement:

df.sample(frac=1)

The frac keyword argument specifies the fraction of rows to return in the random sample, so frac=1 means return all rows (in random order).


Note:If you wish to shuffle your dataframe in-place and reset the index, you could do e.g.

df = df.sample(frac=1).reset_index(drop=True)

Here, specifying drop=True prevents .reset_index from creating a column containing the old index entries.

Follow-up note: Although it may not look like the above operation is in-place, python/pandas is smart enough not to do another malloc for the shuffled object. That is, even though the reference object has changed (by which I mean id(df_old) is not the same as id(df_new)), the underlying C object is still the same. To show that this is indeed the case, you could run a simple memory profiler:

$ python3 -m memory_profiler .\test.pyFilename: .\test.pyLine #    Mem usage    Increment   Line Contents================================================     5     68.5 MiB     68.5 MiB   @profile     6                             def shuffle():     7    847.8 MiB    779.3 MiB       df = pd.DataFrame(np.random.randn(100, 1000000))     8    847.9 MiB      0.1 MiB       df = df.sample(frac=1).reset_index(drop=True)


You can simply use sklearn for this

from sklearn.utils import shuffledf = shuffle(df)


You can shuffle the rows of a dataframe by indexing with a shuffled index. For this, you can eg use np.random.permutation (but np.random.choice is also a possibility):

In [12]: df = pd.read_csv(StringIO(s), sep="\s+")In [13]: dfOut[13]:     Col1  Col2  Col3  Type0      1     2     3     11      4     5     6     120     7     8     9     221    10    11    12     245    13    14    15     346    16    17    18     3In [14]: df.iloc[np.random.permutation(len(df))]Out[14]:     Col1  Col2  Col3  Type46    16    17    18     345    13    14    15     320     7     8     9     20      1     2     3     11      4     5     6     121    10    11    12     2

If you want to keep the index numbered from 1, 2, .., n as in your example, you can simply reset the index: df_shuffled.reset_index(drop=True)