Splitting duplicates into separate table - Pandas

python pandas dataframe duplicates

You can call the duplicated method on the foo column and then subset your original data frame based on it, something like this:

data.loc[data['foo'].duplicated(), :]

As an example:

data = pd.DataFrame({'foo': [1,1,1,2,2,2], 'bar': [1,1,2,2,3,3]})    data# bar foo#0  1   1#1  1   1#2  2   1#3  2   2#4  3   2#5  3   2data.loc[data['foo'].duplicated(), :]# bar foo#1  1   1#2  2   1#4  3   2#5  3   2

python pandas dataframe duplicates

drop_duplicates has a companion method duplicated. They both take similar arguments.

The key arguments are:

subset - column label or sequence of labels
- Only consider certain columns for identifying duplicates, by default use all of the columns
keep - {‘first’, ‘last’, False}, default ‘first’

When keep is set to 'first':

drop_duplicates returns a dataframe in which the first occurrence of the combination of columns specified by subset is kept and drops the rest.
duplicated returns a boolean mask indexed with the same index as the original dataframe with a value for True for all duplicated combinations of the specified set of columns except for the 'first'. You can use this mask to either get at the rows to be dropped or it's complement (the same as drop_duplicates)

Example

df = pd.DataFrame(list('abcdbef'), columns=['letter'])df

df.drop_duplicates(keep='first')  # same as default

df.duplicated(keep='first')  # same as default0    False1    False2    False3    False4     True5    False6    Falsedtype: bool

Notice the row corresponding to the first instance of 'b' is False while the second instance is True indicating it is to be dropped.

Answer

df[df.duplicated(keep='first')]

`keep='last'` and `keep=False`

Here are examples of what it looks like with the keep argument set to 'last' or False

drop duplicates

df.drop_duplicates(keep='last')

df.duplicated(keep='last')0    False1     True2    False3    False4    False5    False6    Falsedtype: bool

This time the first instance is True indicating it is to be dropped while the second instance is False indicating it is not to be dropped.

just the duplicates

df[df.duplicated(keep='last')]

drop duplicates

df.drop_duplicates(keep=False)

df.duplicated(keep=False)0    False1     True2    False3    False4     True5    False6    Falsedtype: bool

This time both instances are True and both are dropped.

just the duplicates

df[df.duplicated(keep=False)]

CodeHunter

Splitting duplicates into separate table - Pandas

Example

Answer

`keep='last'` and `keep=False`

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last

Splitting duplicates into separate table - Pandas

Example

Answer

keep='last' and keep=False

Recent Posts

`keep='last'` and `keep=False`