Create complicated conditional column (geometric mean) Python Create complicated conditional column (geometric mean) Python pandas pandas

Create complicated conditional column (geometric mean) Python


This calculates the geometric mean of each site and checks if it is greater than 30:

>>> df['geo_mean_acceptable'] = (        df.groupby('Site')          .transform(lambda group: group.prod() ** (1 / float(len(group))) > 30)          .astype(bool))

And this gets the geometric mean of each site:

>>> df.groupby('Site').EnteroCount.apply(lambda group: group.product() ** (1 / float(len(group))))SiteA     68.016702B    121.981006C    180.000000Name: EnteroCount, dtype: float64

Using the geometric mean function from scipy:

from scipy.stats.mstats import gmean>>> df.groupby('Site').EnteroCount.apply(gmean)SiteA     68.016702B    121.981006C    180.000000Name: EnteroCount, dtype: float64

Given that the five highest values will give you the highest geometric mean in a group, you can use this:

df.groupby('Site').EnteroCount.apply(lambda group: gmean(group.nlargest(5)))

You can see how it is selecting the largest five values by group, which then get used as parameters for gmean:

>>> df.groupby('Site').EnteroCount.apply(lambda group: group.nlargest(5).values.tolist())SiteA    [1733, 150, 70, 20, 4]B            [1500, 55, 22]C                     [180]Name: EnteroCount, dtype: object

Summary

df['swim'] = np.where(    (df.groupby('Site').EnteroCount.transform(max) > 110) |    (df.groupby('Site').EnteroCount.transform(lambda group: gmean(group.nlargest(5))) > 30),     'unacceptable', 'acceptable')