How to remove duplicates from a dataframe? How to remove duplicates from a dataframe? pandas pandas

How to remove duplicates from a dataframe?


I think you need groupby and sort_values and then use parameter keep=first of drop_duplicates:

print df  IDnumber  Subid Subsubid  Date  Originaldataindicator0        a      1        x  2006                    NaN1        a      1        x  2007                    NaN2        a      1        x  2008                    NaN3        a      1        x  2008                      14        a      1        x  2008                    NaNdf = df.groupby(['IDnumber', 'Subid', 'Subsubid', 'Date'])              .apply(lambda x: x.sort_values('Originaldataindicator')).reset_index(drop=True)print df  IDnumber  Subid Subsubid  Date  Originaldataindicator0        a      1        x  2006                    NaN1        a      1        x  2007                    NaN2        a      1        x  2008                      13        a      1        x  2008                    NaN4        a      1        x  2008                    NaNdf1=df.drop_duplicates(subset=['IDnumber', 'Subid', 'Subsubid', 'Date'], keep='first')print df1  IDnumber  Subid Subsubid  Date  Originaldataindicator0        a      1        x  2006                    NaN1        a      1        x  2007                    NaN2        a      1        x  2008                      1

Or use inplace:

df.drop_duplicates(subset=['IDnumber','Subid','Subsubid','Date'], keep='first', inplace=True)print df  IDnumber  Subid Subsubid  Date  Originaldataindicator0        a      1        x  2006                    NaN1        a      1        x  2007                    NaN2        a      1        x  2008                      1

If column Originaldataindicator have multiple values use duplicated (maybe ther can be add all columns IDnumber,Subid,Subsubid,Date) and isnull:

print df  IDnumber  Subid Subsubid  Date  Originaldataindicator0        a      1        x  2006                    NaN1        a      1        x  2007                    NaN2        a      1        x  2008                    NaN3        a      1        x  2008                      14        a      1        x  2008                      1print df[~((df.duplicated('Date',keep=False))&~(pd.notnull(df['Originaldataindicator'])))]  IDnumber  Subid Subsubid  Date  Originaldataindicator0        a      1        x  2006                    NaN1        a      1        x  2007                    NaN3        a      1        x  2008                      14        a      1        x  2008                      1

Explaining conditions:

print df.duplicated('Date', keep=False)0    False1    False2     True3     True4     Truedtype: boolprint (pd.isnull(df['Originaldataindicator']))0     True1     True2     True3    False4    FalseName: Originaldataindicator, dtype: boolprint ~((df.duplicated('Date', keep=False)) & (pd.isnull(df['Originaldataindicator'])))0     True1     True2    False3     True4     Truedtype: bool


Consider this:

df = pd.DataFrame({'a': [1, 2, 3, 3, 3], 'b': [1, 2, None, 1, None]})

Then

>>> df.sort_values(by=['a', 'b']).groupby(df.a).first()[['b']].reset_index()    a   b0   1   11   2   22   3   1

Sorts the items by first a, then b (thus pushing the None values in each group last), then selects the first item per group.

I believe you can modify this to the specifics of your problem.