Splitting duplicates into separate table - Pandas
You can call the duplicated
method on the foo
column and then subset your original data frame based on it, something like this:
data.loc[data['foo'].duplicated(), :]
As an example:
data = pd.DataFrame({'foo': [1,1,1,2,2,2], 'bar': [1,1,2,2,3,3]}) data# bar foo#0 1 1#1 1 1#2 2 1#3 2 2#4 3 2#5 3 2data.loc[data['foo'].duplicated(), :]# bar foo#1 1 1#2 2 1#4 3 2#5 3 2
drop_duplicates
has a companion method duplicated
. They both take similar arguments.
The key arguments are:
subset
- column label or sequence of labels- Only consider certain columns for identifying duplicates, by default use all of the columns
keep
- {‘first’
,‘last’
,False
}, default‘first’
When keep
is set to 'first'
:
drop_duplicates
returns a dataframe in which the first occurrence of the combination of columns specified bysubset
is kept and drops the rest.duplicated
returns a boolean mask indexed with the same index as the original dataframe with a value for True for all duplicated combinations of the specified set of columns except for the'first'
. You can use this mask to either get at the rows to be dropped or it's complement (the same asdrop_duplicates
)
Example
df = pd.DataFrame(list('abcdbef'), columns=['letter'])df
df.drop_duplicates(keep='first') # same as default
df.duplicated(keep='first') # same as default0 False1 False2 False3 False4 True5 False6 Falsedtype: bool
Notice the row corresponding to the first instance of 'b'
is False
while the second instance is True
indicating it is to be dropped.
Answer
df[df.duplicated(keep='first')]
keep='last'
and keep=False
Here are examples of what it looks like with the keep
argument set to 'last'
or False
drop duplicates
df.drop_duplicates(keep='last')
df.duplicated(keep='last')0 False1 True2 False3 False4 False5 False6 Falsedtype: bool
This time the first instance is True
indicating it is to be dropped while the second instance is False
indicating it is not to be dropped.
just the duplicates
df[df.duplicated(keep='last')]
drop duplicates
df.drop_duplicates(keep=False)
df.duplicated(keep=False)0 False1 True2 False3 False4 True5 False6 Falsedtype: bool
This time both instances are True
and both are dropped.
just the duplicates
df[df.duplicated(keep=False)]