Drop some Pandas dataframe rows using group based condition
In 0.13 you can use cumcount:
In [11]: df[df.sort('C').groupby('A').cumcount(ascending=False) >= 2] # use .sort_index() to remove UserWarningOut[11]: A C D0 foo -0.536732 0.0610554 foo -0.910537 -1.6340475 bar -0.346749 -0.1277407 foo -0.640706 2.635910[4 rows x 3 columns]
It may make more sense to sort first:
In [21]: df = df.sort('C')In [22]: df[df.groupby('A').cumcount(ascending=False) >= 2]Out[22]: A C D4 foo -0.910537 -1.6340477 foo -0.640706 2.6359100 foo -0.536732 0.0610555 bar -0.346749 -0.127740[4 rows x 3 columns]
You can use apply()
method:
import pandas as pdimport iotxt=""" A C D0 foo -0.536732 0.0610551 bar 1.470956 1.3509962 foo 1.981810 0.6769783 bar -0.072829 0.4172854 foo -0.910537 -1.6340475 bar -0.346749 -0.1277406 foo 0.959957 -1.0683857 foo -0.640706 2.635910"""df = pd.read_csv(io.BytesIO(txt), delim_whitespace=True, index_col=0)def f(df): return df.sort("C").iloc[:-2]df2 = df.groupby("A", group_keys=False).apply(f)print df2
output:
A C D5 bar -0.346749 -0.1277404 foo -0.910537 -1.6340477 foo -0.640706 2.6359100 foo -0.536732 0.061055
If you want original order:
print df2.reindex(df.index[df.index.isin(df2.index)])
output:
A C D0 foo -0.536732 0.0610554 foo -0.910537 -1.6340475 bar -0.346749 -0.1277407 foo -0.640706 2.635910
to get rows above group mean:
def f(df): return df[df.C>df.C.mean()]df3 = df.groupby("A", group_keys=False).apply(f)print df3