Pandas join on columns with different names [duplicate] Pandas join on columns with different names [duplicate] pandas pandas

Pandas join on columns with different names [duplicate]


When the names are different, use the xxx_on parameters instead of on=:

pd.merge(df1, df2, left_on=  ['userid', 'column1'],                   right_on= ['username', 'column1'],                    how = 'left')


An alternative approach is to use join setting the index of the right hand side DataFrame to the columns ['username', 'column1']:

df1.join(df2.set_index(['username', 'column1']), on=['userid', 'column1'], how='left')

The output of this join merges the matched keys from the two differently named key columns, userid and username, into a single column named after the key column of df1, userid; whereas the output of the merge maintains the two as separate columns. To illustrate, consider the following example:

import numpy as npimport pandas as pddf1 = pd.DataFrame({'ID': [1,2,3,4,5,6], 'pID' : [21,22,23,24,25,26], 'Values' : [435,33,45,np.nan,np.nan,12]})##    ID  Values  pID## 0   1   435.0   21## 1   2    33.0   22## 2   3    45.0   23## 3   4     NaN   24## 4   5     NaN   25## 5   6    12.0   26df2 = pd.DataFrame({'ID' : [4,4,5], 'pid' : [24,25,25], 'Values' : [544, 545, 676]})##    ID  Values  pid## 0   4     544   24## 1   4     545   25## 2   5     676   25pd.merge(df1, df2, how='left', left_on=['ID', 'pID'], right_on=['ID', 'pid']))##    ID  Values_x  pID  Values_y   pid## 0   1     435.0   21       NaN   NaN## 1   2      33.0   22       NaN   NaN## 2   3      45.0   23       NaN   NaN## 3   4       NaN   24     544.0  24.0## 4   5       NaN   25     676.0  25.0## 5   6      12.0   26       NaN   NaNdf1.join(df2.set_index(['ID','pid']), how='left', on=['ID','pID'], lsuffix='_x', rsuffix='_y'))##    ID  Values_x  pID  Values_y## 0   1     435.0   21       NaN## 1   2      33.0   22       NaN## 2   3      45.0   23       NaN## 3   4       NaN   24     544.0## 4   5       NaN   25     676.0## 5   6      12.0   26       NaN

Here, we also need to specify lsuffix and rsuffix in join to distinguish the overlapping column Value in the output. As one can see, the output of merge contains the extra pid column from the right hand side DataFrame, which IMHO is unnecessary given the context of the merge. Note also that the dtype for the pid column has changed to float64, which results from upcasting due to the NaNs introduced from the unmatched rows.

This aesthetic output is gained at a cost in performance as the call to set_index on the right hand side DataFrame incurs some overhead. However, a quick and dirty profile shows that this is not too horrible, roughly 30%, which may be worth it:

sz = 1000000 # one million rowsdf1 = pd.DataFrame({'ID': np.arange(sz), 'pID' : np.arange(0,2*sz,2), 'Values' : np.random.random(sz)})df2 = pd.DataFrame({'ID': np.concatenate([np.arange(sz/2),np.arange(sz/2)]), 'pid' : np.arange(0,2*sz,2), 'Values' : np.random.random(sz)})%timeit pd.merge(df1, df2, how='left', left_on=['ID', 'pID'], right_on=['ID', 'pid'])## 818 ms ± 33.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)%timeit df1.join(df2.set_index(['ID','pid']), how='left', on=['ID','pID'], lsuffix='_x', rsuffix='_y')## 1.04 s ± 18.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)