Subsample pandas dataframe Subsample pandas dataframe numpy numpy

Subsample pandas dataframe


You can select random elements from the index with np.random.choice. Eg to select 5 random rows:

df = pd.DataFrame(np.random.rand(10))df.loc[np.random.choice(df.index, 5, replace=False)]

This function is new in 1.7. If you want a solution with an older numpy, you can shuffle the data and taken the first elements of that:

df.loc[np.random.permutation(df.index)[:5]]

In this way your DataFrame is not sorted anymore, but if this is needed for plotting (for example, a line plot), you can simply do .sort() afterwards.


Unfortunately np.random.choice appears to be quite slow for small samples (less than 10% of all rows), you may be better off using plain ol' sample:

from random import sampledf.loc[sample(df.index, 1000)]

For large DataFrame (a million rows), we see small samples:

In [11]: %timeit df.loc[sample(df.index, 10)]1000 loops, best of 3: 1.19 ms per loopIn [12]: %timeit df.loc[np.random.choice(df.index, 10, replace=False)]1 loops, best of 3: 1.36 s per loopIn [13]: %timeit df.loc[np.random.permutation(df.index)[:10]]1 loops, best of 3: 1.38 s per loopIn [21]: %timeit df.loc[sample(df.index, 1000)]10 loops, best of 3: 14.5 ms per loopIn [22]: %timeit df.loc[np.random.choice(df.index, 1000, replace=False)]1 loops, best of 3: 1.28 s per loop    In [23]: %timeit df.loc[np.random.permutation(df.index)[:1000]]1 loops, best of 3: 1.3  s per loop

But around 10% it gets about the same:

In [31]: %timeit df.loc[sample(df.index, 100000)]1 loops, best of 3: 1.63 s per loopIn [32]: %timeit df.loc[np.random.choice(df.index, 100000, replace=False)]1 loops, best of 3: 1.36 s per loopIn [33]: %timeit df.loc[np.random.permutation(df.index)[:100000]]1 loops, best of 3: 1.4 s per loop

and if you are sampling everything (don't use sample!):

In [41]: %timeit df.loc[sample(df.index, 1000000)]1 loops, best of 3: 10 s per loop

Note: both numpy.random and random accept a seed, to reproduce randomly generated output.

As @joris points out in the comments, choice (without replacement) is actually sugar for permutation so it's no suprise it's constant time and slower for smaller samples...


These days, one can simply use the sample method on a DataFrame:

>>> help(df.sample)Help on method sample in module pandas.core.generic:sample(self, n=None, frac=None, replace=False, weights=None, random_state=None, axis=None) method of pandas.core.frame.DataFrame instance    Returns a random sample of items from an axis of object.

Replicability can be achieved by using the random_state keyword:

>>> len(set(df.sample(n=1, random_state=np.random.RandomState(0)).iterations.values[0] for _ in xrange(1000)))1>>> len(set(df.sample(n=1).iterations.values[0] for _ in xrange(1000)))40