Remove pandas rows with duplicate indices
I would suggest using the duplicated method on the Pandas Index itself:
df3 = df3[~df3.index.duplicated(keep='first')]
While all the other methods work, .drop_duplicates
is by far the least performant for the provided example. Furthermore, while the groupby method is only slightly less performant, I find the duplicated method to be more readable.
Using the sample data provided:
>>> %timeit df3.reset_index().drop_duplicates(subset='index', keep='first').set_index('index')1000 loops, best of 3: 1.54 ms per loop>>> %timeit df3.groupby(df3.index).first()1000 loops, best of 3: 580 µs per loop>>> %timeit df3[~df3.index.duplicated(keep='first')]1000 loops, best of 3: 307 µs per loop
Note that you can keep the last element by changing the keep argument to 'last'
.
It should also be noted that this method works with MultiIndex
as well (using df1 as specified in Paul's example):
>>> %timeit df1.groupby(level=df1.index.names).last()1000 loops, best of 3: 771 µs per loop>>> %timeit df1[~df1.index.duplicated(keep='last')]1000 loops, best of 3: 365 µs per loop
This adds the index as a DataFrame column, drops duplicates on that, then removes the new column:
df = (df.reset_index() .drop_duplicates(subset='index', keep='last') .set_index('index').sort_index())
Note that the use of .sort_index()
above at the end is as needed and is optional.
Oh my. This is actually so simple!
grouped = df3.groupby(level=0)df4 = grouped.last()df4 A B rownum2001-01-01 00:00:00 0 0 62001-01-01 01:00:00 1 1 72001-01-01 02:00:00 2 2 82001-01-01 03:00:00 3 3 32001-01-01 04:00:00 4 4 42001-01-01 05:00:00 5 5 5
Follow up edit 2013-10-29In the case where I have a fairly complex MultiIndex
, I think I prefer the groupby
approach. Here's simple example for posterity:
import numpy as npimport pandas# fake indexidx = pandas.MultiIndex.from_tuples([('a', letter) for letter in list('abcde')])# random data + naming the index levelsdf1 = pandas.DataFrame(np.random.normal(size=(5,2)), index=idx, columns=['colA', 'colB'])df1.index.names = ['iA', 'iB']# artificially append some duplicate datadf1 = df1.append(df1.select(lambda idx: idx[1] in ['c', 'e']))df1# colA colB#iA iB #a a -1.297535 0.691787# b -1.688411 0.404430# c 0.275806 -0.078871# d -0.509815 -0.220326# e -0.066680 0.607233# c 0.275806 -0.078871 # <--- dup 1# e -0.066680 0.607233 # <--- dup 2
and here's the important part
# group the data, using df1.index.names tells pandas to look at the entire indexgroups = df1.groupby(level=df1.index.names) groups.last() # or .first()# colA colB#iA iB #a a -1.297535 0.691787# b -1.688411 0.404430# c 0.275806 -0.078871# d -0.509815 -0.220326# e -0.066680 0.607233