How do I get a list of all the duplicate items using pandas in python?
Method #1: print all rows where the ID is one of the IDs in duplicated:
>>> import pandas as pd>>> df = pd.read_csv("dup.csv")>>> ids = df["ID"]>>> df[ids.isin(ids[ids.duplicated()])].sort("ID") ID ENROLLMENT_DATE TRAINER_MANAGING TRAINER_OPERATOR FIRST_VISIT_DATE24 11795 27-Feb-12 0643D38-Hanover NH 0643D38-Hanover NH 19-Jun-126 11795 3-Jul-12 0649597-White River VT 0649597-White River VT 30-Mar-1218 8096 19-Dec-11 0649597-White River VT 0649597-White River VT 9-Apr-122 8096 8-Aug-12 0643D38-Hanover NH 0643D38-Hanover NH 25-Jun-1212 A036 30-Nov-11 063B208-Randolph VT 063B208-Randolph VT NaN3 A036 1-Apr-12 06CB8CF-Hanover NH 06CB8CF-Hanover NH 9-Aug-1226 A036 11-Aug-12 06D3206-Hanover NH NaN 19-Jun-12
but I couldn't think of a nice way to prevent repeating ids
so many times. I prefer method #2: groupby
on the ID.
>>> pd.concat(g for _, g in df.groupby("ID") if len(g) > 1) ID ENROLLMENT_DATE TRAINER_MANAGING TRAINER_OPERATOR FIRST_VISIT_DATE6 11795 3-Jul-12 0649597-White River VT 0649597-White River VT 30-Mar-1224 11795 27-Feb-12 0643D38-Hanover NH 0643D38-Hanover NH 19-Jun-122 8096 8-Aug-12 0643D38-Hanover NH 0643D38-Hanover NH 25-Jun-1218 8096 19-Dec-11 0649597-White River VT 0649597-White River VT 9-Apr-123 A036 1-Apr-12 06CB8CF-Hanover NH 06CB8CF-Hanover NH 9-Aug-1212 A036 30-Nov-11 063B208-Randolph VT 063B208-Randolph VT NaN26 A036 11-Aug-12 06D3206-Hanover NH NaN 19-Jun-12
With Pandas version 0.17, you can set 'keep = False' in the duplicated function to get all the duplicate items.
In [1]: import pandas as pdIn [2]: df = pd.DataFrame(['a','b','c','d','a','b'])In [3]: dfOut[3]: 0 0 a 1 b 2 c 3 d 4 a 5 bIn [4]: df[df.duplicated(keep=False)]Out[4]: 0 0 a 1 b 4 a 5 b
df[df.duplicated(['ID'], keep=False)]
it'll return all duplicated rows back to you.
According to documentation:
keep : {‘first’, ‘last’, False}, default ‘first’
- first : Mark duplicates as True except for the first occurrence.
- last : Mark duplicates as True except for the last occurrence.
- False : Mark all duplicates as True.