Create complicated conditional column (geometric mean) Python
This calculates the geometric mean of each site and checks if it is greater than 30:
>>> df['geo_mean_acceptable'] = ( df.groupby('Site') .transform(lambda group: group.prod() ** (1 / float(len(group))) > 30) .astype(bool))
And this gets the geometric mean of each site:
>>> df.groupby('Site').EnteroCount.apply(lambda group: group.product() ** (1 / float(len(group))))SiteA 68.016702B 121.981006C 180.000000Name: EnteroCount, dtype: float64
Using the geometric mean function from scipy:
from scipy.stats.mstats import gmean>>> df.groupby('Site').EnteroCount.apply(gmean)SiteA 68.016702B 121.981006C 180.000000Name: EnteroCount, dtype: float64
Given that the five highest values will give you the highest geometric mean in a group, you can use this:
df.groupby('Site').EnteroCount.apply(lambda group: gmean(group.nlargest(5)))
You can see how it is selecting the largest five values by group, which then get used as parameters for gmean
:
>>> df.groupby('Site').EnteroCount.apply(lambda group: group.nlargest(5).values.tolist())SiteA [1733, 150, 70, 20, 4]B [1500, 55, 22]C [180]Name: EnteroCount, dtype: object
Summary
df['swim'] = np.where( (df.groupby('Site').EnteroCount.transform(max) > 110) | (df.groupby('Site').EnteroCount.transform(lambda group: gmean(group.nlargest(5))) > 30), 'unacceptable', 'acceptable')