Random Sample of a subset of a dataframe in Pandas
You can use the sample
method*:
In [11]: df = pd.DataFrame([[1, 2], [3, 4], [5, 6], [7, 8]], columns=["A", "B"])In [12]: df.sample(2)Out[12]: A B0 1 22 5 6In [13]: df.sample(2)Out[13]: A B3 7 80 1 2
*On one of the section DataFrames.
Note: If you have a larger sample size that the size of the DataFrame this will raise an error unless you sample with replacement.
In [14]: df.sample(5)ValueError: Cannot take a larger sample than population when 'replace=False'In [15]: df.sample(5, replace=True)Out[15]: A B0 1 21 3 42 5 63 7 81 3 4
One solution is to use the choice
function from numpy.
Say you want 50 entries out of 100, you can use:
import numpy as npchosen_idx = np.random.choice(1000, replace=False, size=50)df_trimmed = df.iloc[chosen_idx]
This is of course not considering your block structure. If you want a 50 item sample from block i
for example, you can do:
import numpy as npblock_start_idx = 1000 * ichosen_idx = np.random.choice(1000, replace=False, size=50)df_trimmed_from_block_i = df.iloc[block_start_idx + chosen_idx]
Thank you, Jeff,But I received an error;
AttributeError: Cannot access callable attribute 'sample' of 'DataFrameGroupBy' objects, try using the 'apply' method
So I suggest instead of sample = df.groupby("section").sample(50)
using below command :
df.groupby('section').apply(lambda grp: grp.sample(50))