Drop some Pandas dataframe rows using group based condition Drop some Pandas dataframe rows using group based condition pandas pandas

Drop some Pandas dataframe rows using group based condition


In 0.13 you can use cumcount:

In [11]: df[df.sort('C').groupby('A').cumcount(ascending=False) >= 2]  # use .sort_index() to remove UserWarningOut[11]:      A         C         D0  foo -0.536732  0.0610554  foo -0.910537 -1.6340475  bar -0.346749 -0.1277407  foo -0.640706  2.635910[4 rows x 3 columns]

It may make more sense to sort first:

In [21]: df = df.sort('C')In [22]: df[df.groupby('A').cumcount(ascending=False) >= 2]Out[22]:      A         C         D4  foo -0.910537 -1.6340477  foo -0.640706  2.6359100  foo -0.536732  0.0610555  bar -0.346749 -0.127740[4 rows x 3 columns]


You can use apply() method:

import pandas as pdimport iotxt="""     A         C         D0  foo -0.536732  0.0610551  bar  1.470956  1.3509962  foo  1.981810  0.6769783  bar -0.072829  0.4172854  foo -0.910537 -1.6340475  bar -0.346749 -0.1277406  foo  0.959957 -1.0683857  foo -0.640706  2.635910"""df = pd.read_csv(io.BytesIO(txt), delim_whitespace=True, index_col=0)def f(df):    return df.sort("C").iloc[:-2]df2 = df.groupby("A", group_keys=False).apply(f)print df2

output:

     A         C         D5  bar -0.346749 -0.1277404  foo -0.910537 -1.6340477  foo -0.640706  2.6359100  foo -0.536732  0.061055

If you want original order:

print df2.reindex(df.index[df.index.isin(df2.index)])

output:

    A         C         D0  foo -0.536732  0.0610554  foo -0.910537 -1.6340475  bar -0.346749 -0.1277407  foo -0.640706  2.635910

to get rows above group mean:

def f(df):    return df[df.C>df.C.mean()]df3 = df.groupby("A", group_keys=False).apply(f)print df3