Chaining grouping, filtration and aggregation
You can do it this way:
In [310]: dfOut[310]: a b0 1 41 7 32 6 93 4 44 0 25 8 46 7 77 0 58 8 59 8 710 6 111 3 812 7 413 8 014 5 315 5 316 8 117 7 218 9 919 3 220 9 121 1 222 0 323 8 924 7 725 8 126 5 827 9 628 2 829 9 0In [314]: r = df.groupby('a').apply(lambda x: x.b.mean() if len(x)>=5 else -1)In [315]: rOut[315]:a0 -1.0000001 -1.0000002 -1.0000003 -1.0000004 -1.0000005 -1.0000006 -1.0000007 4.6000008 3.8571439 -1.000000dtype: float64In [316]: r[r>0]Out[316]:a7 4.6000008 3.857143dtype: float64
One-liner, which returns data frame instead of series:
df.groupby('a') \ .apply(lambda x: x.b.mean() if len(x)>=5 else -1) \ .to_frame() \ .rename(columns={0:'mean'}) \ .query('mean > 0')
Timeit comparison against a DF with 100.000 rows:
def maxu(): r = df.groupby('a').apply(lambda x: x.b.mean() if len(x)>=5 else -1) return r[r>0]def maxu2(): return df.groupby('a') \ .apply(lambda x: x.b.mean() if len(x)>=5 else -1) \ .to_frame() \ .rename(columns={0:'mean'}) \ .query('mean > 0')def alexander(): return df.groupby('a', as_index=False).filter(lambda group: group.a.count() >= 5).groupby('a').mean()def alexander2(): vc = df.a.value_counts() return df.loc[df.a.isin(vc[vc >= 5].index)].groupby('a').mean()
Results:
In [419]: %timeit maxu()1 loop, best of 3: 1.12 s per loopIn [420]: %timeit maxu2()1 loop, best of 3: 1.12 s per loopIn [421]: %timeit alexander()1 loop, best of 3: 34.9 s per loopIn [422]: %timeit alexander2()10 loops, best of 3: 66.6 ms per loop
Check:
In [423]: alexander2().sum()Out[423]:b 19220943.162dtype: float64In [424]: maxu2().sum()Out[424]:mean 19220943.162dtype: float64
Conclusion:
clear winner is alexander2()
function
@Alexander, congratulations!
Here is some reproduceable data:
np.random.seed(0)df = pd.DataFrame(np.random.randint(0, 10, (10, 2)), columns=list('AB'))>>> df A B0 5 01 3 32 7 93 3 54 2 45 7 66 8 87 1 68 7 79 8 1
A sample filter application demonstrating that it works on the data.
gb = df.groupby('A')>>> gb.filter(lambda group: group.A.count() >= 3) A B2 7 95 7 68 7 7
Here are some of your options:
1) You can also first filter based on the value counts, and then group.
vc = df.A.value_counts()>>> df.loc[df.A.isin(vc[vc >= 2].index)].groupby('A').mean() BA 3 4.0000007 7.3333338 4.500000
2) Perform groupby twice, before and after the filter:
>>> (df.groupby('A', as_index=False) .filter(lambda group: group.A.count() >= 2) .groupby('A') .mean()) BA 3 4.0000007 7.3333338 4.500000
3) Given that your first groupby returns the groups, you can also filter on those:
d = {k: v for k, v in df.groupby('A').groups.items() if len(v) >= 2} # gb.groups.iteritems() for Python 2>>> d{3: [1, 3], 7: [2, 5, 8], 8: [6, 9]}
This is a bit of a hack, but should be relatively efficient as you don't need to regroup.
>>> pd.DataFrame({col: [df.ix[d[col], 'B'].mean()] for col in d}).T.rename(columns={0: 'B'}) B3 4.0000007 7.3333338 4.500000
Timings with 100k rows
np.random.seed(0)df = pd.DataFrame(np.random.randint(0, 10, (100000, 2)), columns=list('AB'))%timeit df.groupby('A', as_index=False).filter(lambda group: group['A'].count() >= 5).groupby('A').mean()100 loops, best of 3: 18 ms per loop%%timeitvc = df.A.value_counts()df.loc[df.A.isin(vc[vc >= 2].index)].groupby('A').mean()100 loops, best of 3: 15.7 ms per loop