Random Sample of a subset of a dataframe in Pandas Random Sample of a subset of a dataframe in Pandas python python

Random Sample of a subset of a dataframe in Pandas


You can use the sample method*:

In [11]: df = pd.DataFrame([[1, 2], [3, 4], [5, 6], [7, 8]], columns=["A", "B"])In [12]: df.sample(2)Out[12]:   A  B0  1  22  5  6In [13]: df.sample(2)Out[13]:   A  B3  7  80  1  2

*On one of the section DataFrames.

Note: If you have a larger sample size that the size of the DataFrame this will raise an error unless you sample with replacement.

In [14]: df.sample(5)ValueError: Cannot take a larger sample than population when 'replace=False'In [15]: df.sample(5, replace=True)Out[15]:   A  B0  1  21  3  42  5  63  7  81  3  4


One solution is to use the choice function from numpy.

Say you want 50 entries out of 100, you can use:

import numpy as npchosen_idx = np.random.choice(1000, replace=False, size=50)df_trimmed = df.iloc[chosen_idx]

This is of course not considering your block structure. If you want a 50 item sample from block i for example, you can do:

import numpy as npblock_start_idx = 1000 * ichosen_idx = np.random.choice(1000, replace=False, size=50)df_trimmed_from_block_i = df.iloc[block_start_idx + chosen_idx]


Thank you, Jeff,But I received an error;

AttributeError: Cannot access callable attribute 'sample' of 'DataFrameGroupBy' objects, try using the 'apply' method

So I suggest instead of sample = df.groupby("section").sample(50) using below command :

df.groupby('section').apply(lambda grp: grp.sample(50))