Remove pandas rows with duplicate indices

I would suggest using the duplicated method on the Pandas Index itself:

df3 = df3[~df3.index.duplicated(keep='first')]

While all the other methods work, .drop_duplicates is by far the least performant for the provided example. Furthermore, while the groupby method is only slightly less performant, I find the duplicated method to be more readable.

Using the sample data provided:

>>> %timeit df3.reset_index().drop_duplicates(subset='index', keep='first').set_index('index')1000 loops, best of 3: 1.54 ms per loop>>> %timeit df3.groupby(df3.index).first()1000 loops, best of 3: 580 µs per loop>>> %timeit df3[~df3.index.duplicated(keep='first')]1000 loops, best of 3: 307 µs per loop

Note that you can keep the last element by changing the keep argument to 'last'.

It should also be noted that this method works with MultiIndex as well (using df1 as specified in Paul's example):

>>> %timeit df1.groupby(level=df1.index.names).last()1000 loops, best of 3: 771 µs per loop>>> %timeit df1[~df1.index.duplicated(keep='last')]1000 loops, best of 3: 365 µs per loop

python pandas dataframe duplicates

This adds the index as a DataFrame column, drops duplicates on that, then removes the new column:

df = (df.reset_index()        .drop_duplicates(subset='index', keep='last')        .set_index('index').sort_index())

Note that the use of .sort_index() above at the end is as needed and is optional.

python pandas dataframe duplicates

Oh my. This is actually so simple!

grouped = df3.groupby(level=0)df4 = grouped.last()df4                      A   B  rownum2001-01-01 00:00:00   0   0       62001-01-01 01:00:00   1   1       72001-01-01 02:00:00   2   2       82001-01-01 03:00:00   3   3       32001-01-01 04:00:00   4   4       42001-01-01 05:00:00   5   5       5

Follow up edit 2013-10-29In the case where I have a fairly complex MultiIndex, I think I prefer the groupby approach. Here's simple example for posterity:

import numpy as npimport pandas# fake indexidx = pandas.MultiIndex.from_tuples([('a', letter) for letter in list('abcde')])# random data + naming the index levelsdf1 = pandas.DataFrame(np.random.normal(size=(5,2)), index=idx, columns=['colA', 'colB'])df1.index.names = ['iA', 'iB']# artificially append some duplicate datadf1 = df1.append(df1.select(lambda idx: idx[1] in ['c', 'e']))df1#           colA      colB#iA iB                    #a  a  -1.297535  0.691787#   b  -1.688411  0.404430#   c   0.275806 -0.078871#   d  -0.509815 -0.220326#   e  -0.066680  0.607233#   c   0.275806 -0.078871  # <--- dup 1#   e  -0.066680  0.607233  # <--- dup 2

and here's the important part

# group the data, using df1.index.names tells pandas to look at the entire indexgroups = df1.groupby(level=df1.index.names)  groups.last() # or .first()#           colA      colB#iA iB                    #a  a  -1.297535  0.691787#   b  -1.688411  0.404430#   c   0.275806 -0.078871#   d  -0.509815 -0.220326#   e  -0.066680  0.607233

CodeHunter

Remove pandas rows with duplicate indices

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last