Subsample pandas dataframe

You can select random elements from the index with np.random.choice. Eg to select 5 random rows:

df = pd.DataFrame(np.random.rand(10))df.loc[np.random.choice(df.index, 5, replace=False)]

This function is new in 1.7. If you want a solution with an older numpy, you can shuffle the data and taken the first elements of that:

df.loc[np.random.permutation(df.index)[:5]]

In this way your DataFrame is not sorted anymore, but if this is needed for plotting (for example, a line plot), you can simply do .sort() afterwards.

python numpy pandas subsampling

Unfortunately np.random.choice appears to be quite slow for small samples (less than 10% of all rows), you may be better off using plain ol' sample:

from random import sampledf.loc[sample(df.index, 1000)]

For large DataFrame (a million rows), we see small samples:

In [11]: %timeit df.loc[sample(df.index, 10)]1000 loops, best of 3: 1.19 ms per loopIn [12]: %timeit df.loc[np.random.choice(df.index, 10, replace=False)]1 loops, best of 3: 1.36 s per loopIn [13]: %timeit df.loc[np.random.permutation(df.index)[:10]]1 loops, best of 3: 1.38 s per loopIn [21]: %timeit df.loc[sample(df.index, 1000)]10 loops, best of 3: 14.5 ms per loopIn [22]: %timeit df.loc[np.random.choice(df.index, 1000, replace=False)]1 loops, best of 3: 1.28 s per loop    In [23]: %timeit df.loc[np.random.permutation(df.index)[:1000]]1 loops, best of 3: 1.3  s per loop

But around 10% it gets about the same:

In [31]: %timeit df.loc[sample(df.index, 100000)]1 loops, best of 3: 1.63 s per loopIn [32]: %timeit df.loc[np.random.choice(df.index, 100000, replace=False)]1 loops, best of 3: 1.36 s per loopIn [33]: %timeit df.loc[np.random.permutation(df.index)[:100000]]1 loops, best of 3: 1.4 s per loop

and if you are sampling everything (don't use sample!):

In [41]: %timeit df.loc[sample(df.index, 1000000)]1 loops, best of 3: 10 s per loop

Note: both numpy.random and random accept a seed, to reproduce randomly generated output.

As @joris points out in the comments, choice (without replacement) is actually sugar for permutation so it's no suprise it's constant time and slower for smaller samples...

python numpy pandas subsampling

These days, one can simply use the sample method on a DataFrame:

>>> help(df.sample)Help on method sample in module pandas.core.generic:sample(self, n=None, frac=None, replace=False, weights=None, random_state=None, axis=None) method of pandas.core.frame.DataFrame instance    Returns a random sample of items from an axis of object.

Replicability can be achieved by using the random_state keyword:

>>> len(set(df.sample(n=1, random_state=np.random.RandomState(0)).iterations.values[0] for _ in xrange(1000)))1>>> len(set(df.sample(n=1).iterations.values[0] for _ in xrange(1000)))40

CodeHunter

Subsample pandas dataframe

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last