How to remove duplicates from a dataframe?
I think you need groupby
and sort_values
and then use parameter keep=first
of drop_duplicates
:
print df IDnumber Subid Subsubid Date Originaldataindicator0 a 1 x 2006 NaN1 a 1 x 2007 NaN2 a 1 x 2008 NaN3 a 1 x 2008 14 a 1 x 2008 NaNdf = df.groupby(['IDnumber', 'Subid', 'Subsubid', 'Date']) .apply(lambda x: x.sort_values('Originaldataindicator')).reset_index(drop=True)print df IDnumber Subid Subsubid Date Originaldataindicator0 a 1 x 2006 NaN1 a 1 x 2007 NaN2 a 1 x 2008 13 a 1 x 2008 NaN4 a 1 x 2008 NaNdf1=df.drop_duplicates(subset=['IDnumber', 'Subid', 'Subsubid', 'Date'], keep='first')print df1 IDnumber Subid Subsubid Date Originaldataindicator0 a 1 x 2006 NaN1 a 1 x 2007 NaN2 a 1 x 2008 1
Or use inplace
:
df.drop_duplicates(subset=['IDnumber','Subid','Subsubid','Date'], keep='first', inplace=True)print df IDnumber Subid Subsubid Date Originaldataindicator0 a 1 x 2006 NaN1 a 1 x 2007 NaN2 a 1 x 2008 1
If column Originaldataindicator
have multiple values use duplicated
(maybe ther can be add all columns IDnumber
,Subid
,Subsubid
,Date
) and isnull
:
print df IDnumber Subid Subsubid Date Originaldataindicator0 a 1 x 2006 NaN1 a 1 x 2007 NaN2 a 1 x 2008 NaN3 a 1 x 2008 14 a 1 x 2008 1print df[~((df.duplicated('Date',keep=False))&~(pd.notnull(df['Originaldataindicator'])))] IDnumber Subid Subsubid Date Originaldataindicator0 a 1 x 2006 NaN1 a 1 x 2007 NaN3 a 1 x 2008 14 a 1 x 2008 1
Explaining conditions:
print df.duplicated('Date', keep=False)0 False1 False2 True3 True4 Truedtype: boolprint (pd.isnull(df['Originaldataindicator']))0 True1 True2 True3 False4 FalseName: Originaldataindicator, dtype: boolprint ~((df.duplicated('Date', keep=False)) & (pd.isnull(df['Originaldataindicator'])))0 True1 True2 False3 True4 Truedtype: bool
Consider this:
df = pd.DataFrame({'a': [1, 2, 3, 3, 3], 'b': [1, 2, None, 1, None]})
Then
>>> df.sort_values(by=['a', 'b']).groupby(df.a).first()[['b']].reset_index() a b0 1 11 2 22 3 1
Sorts the items by first a
, then b
(thus pushing the None
values in each group last), then selects the first item per group.
I believe you can modify this to the specifics of your problem.