Chaining grouping, filtration and aggregation Chaining grouping, filtration and aggregation pandas pandas

Chaining grouping, filtration and aggregation


You can do it this way:

In [310]: dfOut[310]:    a  b0   1  41   7  32   6  93   4  44   0  25   8  46   7  77   0  58   8  59   8  710  6  111  3  812  7  413  8  014  5  315  5  316  8  117  7  218  9  919  3  220  9  121  1  222  0  323  8  924  7  725  8  126  5  827  9  628  2  829  9  0In [314]: r = df.groupby('a').apply(lambda x: x.b.mean() if len(x)>=5 else -1)In [315]: rOut[315]:a0   -1.0000001   -1.0000002   -1.0000003   -1.0000004   -1.0000005   -1.0000006   -1.0000007    4.6000008    3.8571439   -1.000000dtype: float64In [316]: r[r>0]Out[316]:a7    4.6000008    3.857143dtype: float64

One-liner, which returns data frame instead of series:

df.groupby('a') \  .apply(lambda x: x.b.mean() if len(x)>=5 else -1) \  .to_frame() \  .rename(columns={0:'mean'}) \  .query('mean > 0')

Timeit comparison against a DF with 100.000 rows:

def maxu():    r = df.groupby('a').apply(lambda x: x.b.mean() if len(x)>=5 else -1)    return r[r>0]def maxu2():    return df.groupby('a') \             .apply(lambda x: x.b.mean() if len(x)>=5 else -1) \             .to_frame() \             .rename(columns={0:'mean'}) \             .query('mean > 0')def alexander():    return df.groupby('a', as_index=False).filter(lambda group: group.a.count() >= 5).groupby('a').mean()def alexander2():    vc = df.a.value_counts()    return df.loc[df.a.isin(vc[vc >= 5].index)].groupby('a').mean()

Results:

In [419]: %timeit maxu()1 loop, best of 3: 1.12 s per loopIn [420]: %timeit maxu2()1 loop, best of 3: 1.12 s per loopIn [421]: %timeit alexander()1 loop, best of 3: 34.9 s per loopIn [422]: %timeit alexander2()10 loops, best of 3: 66.6 ms per loop

Check:

In [423]: alexander2().sum()Out[423]:b   19220943.162dtype: float64In [424]: maxu2().sum()Out[424]:mean   19220943.162dtype: float64

Conclusion:

clear winner is alexander2() function

@Alexander, congratulations!


Here is some reproduceable data:

np.random.seed(0)df = pd.DataFrame(np.random.randint(0, 10, (10, 2)), columns=list('AB'))>>> df   A  B0  5  01  3  32  7  93  3  54  2  45  7  66  8  87  1  68  7  79  8  1

A sample filter application demonstrating that it works on the data.

gb = df.groupby('A')>>> gb.filter(lambda group: group.A.count() >= 3)   A  B2  7  95  7  68  7  7

Here are some of your options:

1) You can also first filter based on the value counts, and then group.

vc = df.A.value_counts()>>> df.loc[df.A.isin(vc[vc >= 2].index)].groupby('A').mean()          BA          3  4.0000007  7.3333338  4.500000

2) Perform groupby twice, before and after the filter:

>>> (df.groupby('A', as_index=False)       .filter(lambda group: group.A.count() >= 2)       .groupby('A')       .mean())          BA          3  4.0000007  7.3333338  4.500000

3) Given that your first groupby returns the groups, you can also filter on those:

d = {k: v      for k, v in df.groupby('A').groups.items()      if len(v) >= 2}  # gb.groups.iteritems() for Python 2>>> d{3: [1, 3], 7: [2, 5, 8], 8: [6, 9]}

This is a bit of a hack, but should be relatively efficient as you don't need to regroup.

>>> pd.DataFrame({col: [df.ix[d[col], 'B'].mean()] for col in d}).T.rename(columns={0: 'B'})          B3  4.0000007  7.3333338  4.500000

Timings with 100k rows

np.random.seed(0)df = pd.DataFrame(np.random.randint(0, 10, (100000, 2)), columns=list('AB'))%timeit df.groupby('A', as_index=False).filter(lambda group: group['A'].count() >= 5).groupby('A').mean()100 loops, best of 3: 18 ms per loop%%timeitvc = df.A.value_counts()df.loc[df.A.isin(vc[vc >= 2].index)].groupby('A').mean()100 loops, best of 3: 15.7 ms per loop