Pandas join on columns with different names [duplicate]

When the names are different, use the xxx_on parameters instead of on=:

pd.merge(df1, df2, left_on=  ['userid', 'column1'],                   right_on= ['username', 'column1'],                    how = 'left')

python sql pandas merge

An alternative approach is to use join setting the index of the right hand side DataFrame to the columns ['username', 'column1']:

df1.join(df2.set_index(['username', 'column1']), on=['userid', 'column1'], how='left')

The output of this join merges the matched keys from the two differently named key columns, userid and username, into a single column named after the key column of df1, userid; whereas the output of the merge maintains the two as separate columns. To illustrate, consider the following example:

import numpy as npimport pandas as pddf1 = pd.DataFrame({'ID': [1,2,3,4,5,6], 'pID' : [21,22,23,24,25,26], 'Values' : [435,33,45,np.nan,np.nan,12]})##    ID  Values  pID## 0   1   435.0   21## 1   2    33.0   22## 2   3    45.0   23## 3   4     NaN   24## 4   5     NaN   25## 5   6    12.0   26df2 = pd.DataFrame({'ID' : [4,4,5], 'pid' : [24,25,25], 'Values' : [544, 545, 676]})##    ID  Values  pid## 0   4     544   24## 1   4     545   25## 2   5     676   25pd.merge(df1, df2, how='left', left_on=['ID', 'pID'], right_on=['ID', 'pid']))##    ID  Values_x  pID  Values_y   pid## 0   1     435.0   21       NaN   NaN## 1   2      33.0   22       NaN   NaN## 2   3      45.0   23       NaN   NaN## 3   4       NaN   24     544.0  24.0## 4   5       NaN   25     676.0  25.0## 5   6      12.0   26       NaN   NaNdf1.join(df2.set_index(['ID','pid']), how='left', on=['ID','pID'], lsuffix='_x', rsuffix='_y'))##    ID  Values_x  pID  Values_y## 0   1     435.0   21       NaN## 1   2      33.0   22       NaN## 2   3      45.0   23       NaN## 3   4       NaN   24     544.0## 4   5       NaN   25     676.0## 5   6      12.0   26       NaN

Here, we also need to specify lsuffix and rsuffix in join to distinguish the overlapping column Value in the output. As one can see, the output of merge contains the extra pid column from the right hand side DataFrame, which IMHO is unnecessary given the context of the merge. Note also that the dtype for the pid column has changed to float64, which results from upcasting due to the NaNs introduced from the unmatched rows.

This aesthetic output is gained at a cost in performance as the call to set_index on the right hand side DataFrame incurs some overhead. However, a quick and dirty profile shows that this is not too horrible, roughly 30%, which may be worth it:

sz = 1000000 # one million rowsdf1 = pd.DataFrame({'ID': np.arange(sz), 'pID' : np.arange(0,2*sz,2), 'Values' : np.random.random(sz)})df2 = pd.DataFrame({'ID': np.concatenate([np.arange(sz/2),np.arange(sz/2)]), 'pid' : np.arange(0,2*sz,2), 'Values' : np.random.random(sz)})%timeit pd.merge(df1, df2, how='left', left_on=['ID', 'pID'], right_on=['ID', 'pid'])## 818 ms ± 33.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)%timeit df1.join(df2.set_index(['ID','pid']), how='left', on=['ID','pID'], lsuffix='_x', rsuffix='_y')## 1.04 s ± 18.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

CodeHunter

Pandas join on columns with different names [duplicate]

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last