How do I get a list of all the duplicate items using pandas in python? How do I get a list of all the duplicate items using pandas in python? python python

How do I get a list of all the duplicate items using pandas in python?


Method #1: print all rows where the ID is one of the IDs in duplicated:

>>> import pandas as pd>>> df = pd.read_csv("dup.csv")>>> ids = df["ID"]>>> df[ids.isin(ids[ids.duplicated()])].sort("ID")       ID ENROLLMENT_DATE        TRAINER_MANAGING        TRAINER_OPERATOR FIRST_VISIT_DATE24  11795       27-Feb-12      0643D38-Hanover NH      0643D38-Hanover NH        19-Jun-126   11795        3-Jul-12  0649597-White River VT  0649597-White River VT        30-Mar-1218   8096       19-Dec-11  0649597-White River VT  0649597-White River VT         9-Apr-122    8096        8-Aug-12      0643D38-Hanover NH      0643D38-Hanover NH        25-Jun-1212   A036       30-Nov-11     063B208-Randolph VT     063B208-Randolph VT              NaN3    A036        1-Apr-12      06CB8CF-Hanover NH      06CB8CF-Hanover NH         9-Aug-1226   A036       11-Aug-12      06D3206-Hanover NH                     NaN        19-Jun-12

but I couldn't think of a nice way to prevent repeating ids so many times. I prefer method #2: groupby on the ID.

>>> pd.concat(g for _, g in df.groupby("ID") if len(g) > 1)       ID ENROLLMENT_DATE        TRAINER_MANAGING        TRAINER_OPERATOR FIRST_VISIT_DATE6   11795        3-Jul-12  0649597-White River VT  0649597-White River VT        30-Mar-1224  11795       27-Feb-12      0643D38-Hanover NH      0643D38-Hanover NH        19-Jun-122    8096        8-Aug-12      0643D38-Hanover NH      0643D38-Hanover NH        25-Jun-1218   8096       19-Dec-11  0649597-White River VT  0649597-White River VT         9-Apr-123    A036        1-Apr-12      06CB8CF-Hanover NH      06CB8CF-Hanover NH         9-Aug-1212   A036       30-Nov-11     063B208-Randolph VT     063B208-Randolph VT              NaN26   A036       11-Aug-12      06D3206-Hanover NH                     NaN        19-Jun-12


With Pandas version 0.17, you can set 'keep = False' in the duplicated function to get all the duplicate items.

In [1]: import pandas as pdIn [2]: df = pd.DataFrame(['a','b','c','d','a','b'])In [3]: dfOut[3]:        0    0  a    1  b    2  c    3  d    4  a    5  bIn [4]: df[df.duplicated(keep=False)]Out[4]:        0    0  a    1  b    4  a    5  b


df[df.duplicated(['ID'], keep=False)]

it'll return all duplicated rows back to you.

According to documentation:

keep : {‘first’, ‘last’, False}, default ‘first’

  • first : Mark duplicates as True except for the first occurrence.
  • last : Mark duplicates as True except for the last occurrence.
  • False : Mark all duplicates as True.