Pandas/Python: How to concatenate two dataframes without duplicates?
The simplest way is to just do the concatenation, and then drop duplicates.
>>> df1 A B0 1 21 3 1>>> df2 A B0 5 61 3 1>>> pandas.concat([df1,df2]).drop_duplicates().reset_index(drop=True) A B0 1 21 3 12 5 6
The reset_index(drop=True)
is to fix up the index after the concat()
and drop_duplicates()
. Without it you will have an index of [0,1,0]
instead of [0,1,2]
. This could cause problems for further operations on this dataframe
down the road if it isn't reset right away.
In case you have a duplicate row already in DataFrame A, then concatenating and then dropping duplicate rows, will remove rows from DataFrame A that you might want to keep.
In this case, you will need to create a new column with a cumulative count, and then drop duplicates, it all depends on your use case, but this is common in time-series data
Here is an example:
df_1 = pd.DataFrame([{'date':'11/20/2015', 'id':4, 'value':24},{'date':'11/20/2015', 'id':4, 'value':24},{'date':'11/20/2015', 'id':6, 'value':34},])df_2 = pd.DataFrame([{'date':'11/20/2015', 'id':4, 'value':24},{'date':'11/20/2015', 'id':6, 'value':14},])df_1['count'] = df_1.groupby(['date','id','value']).cumcount()df_2['count'] = df_2.groupby(['date','id','value']).cumcount()df_tot = pd.concat([df_1,df_2], ignore_index=False)df_tot = df_tot.drop_duplicates()df_tot = df_tot.drop(['count'], axis=1)>>> df_totdate id value0 11/20/2015 4 241 11/20/2015 4 242 11/20/2015 6 341 11/20/2015 6 14
I'm surprised that pandas doesn't offer a native solution for this task.I don't think that it's efficient to just drop the duplicates if you work with large datasets (as Rian G suggested).
It is probably most efficient to use sets to find the non-overlapping indices. Then use list comprehension to translate from index to 'row location' (boolean), which you need to access rows using iloc[,]. Below you find a function that performs the task. If you don't choose a specific column (col) to check for duplicates, then indexes will be used, as you requested. If you chose a specific column, be aware that existing duplicate entries in 'a' will remain in the result.
import pandas as pddef append_non_duplicates(a, b, col=None): if ((a is not None and type(a) is not pd.core.frame.DataFrame) or (b is not None and type(b) is not pd.core.frame.DataFrame)): raise ValueError('a and b must be of type pandas.core.frame.DataFrame.') if (a is None): return(b) if (b is None): return(a) if(col is not None): aind = a.iloc[:,col].values bind = b.iloc[:,col].values else: aind = a.index.values bind = b.index.values take_rows = list(set(bind)-set(aind)) take_rows = [i in take_rows for i in bind] return(a.append( b.iloc[take_rows,:] ))# Usagea = pd.DataFrame([[1,2,3],[1,5,6],[1,12,13]], index=[1000,2000,5000])b = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], index=[1000,2000,3000])append_non_duplicates(a,b)# 0 1 2# 1000 1 2 3 <- from a# 2000 1 5 6 <- from a# 5000 1 12 13 <- from a# 3000 7 8 9 <- from bappend_non_duplicates(a,b,0)# 0 1 2# 1000 1 2 3 <- from a# 2000 1 5 6 <- from a# 5000 1 12 13 <- from a# 2000 4 5 6 <- from b# 3000 7 8 9 <- from b