Pandas Merging 101
This post aims to give readers a primer on SQL-flavored merging with Pandas, how to use it, and when not to use it.
In particular, here's what this post will go through:
The basics - types of joins (LEFT, RIGHT, OUTER, INNER)
- merging with different column names
- merging with multiple columns
- avoiding duplicate merge key column in output
What this post (and other posts by me on this thread) will not go through:
- Performance-related discussions and timings (for now). Mostly notable mentions of better alternatives, wherever appropriate.
- Handling suffixes, removing extra columns, renaming outputs, and other specific use cases. There are other (read: better) posts that deal with that, so figure it out!
NoteMost examples default to INNER JOIN operations while demonstrating various features, unless otherwise specified.
Furthermore, all the DataFrames here can be copied and replicated soyou can play with them. Also, see thisposton how to read DataFrames from your clipboard.
Lastly, all visual representation of JOIN operations have been hand-drawn using Google Drawings. Inspiration from here.
Enough talk - just show me how to use merge
!
Setup & Basics
np.random.seed(0)left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4)})right = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': np.random.randn(4)})left key value0 A 1.7640521 B 0.4001572 C 0.9787383 D 2.240893right key value0 B 1.8675581 D -0.9772782 E 0.9500883 F -0.151357
For the sake of simplicity, the key column has the same name (for now).
An INNER JOIN is represented by

NoteThis, along with the forthcoming figures all follow this convention:
- blue indicates rows that are present in the merge result
- red indicates rows that are excluded from the result (i.e., removed)
- green indicates missing values that are replaced with
NaN
s in the result
To perform an INNER JOIN, call merge
on the left DataFrame, specifying the right DataFrame and the join key (at the very least) as arguments.
left.merge(right, on='key')# Or, if you want to be explicit# left.merge(right, on='key', how='inner') key value_x value_y0 B 0.400157 1.8675581 D 2.240893 -0.977278
This returns only rows from left
and right
which share a common key (in this example, "B" and "D).
A LEFT OUTER JOIN, or LEFT JOIN is represented by

This can be performed by specifying how='left'
.
left.merge(right, on='key', how='left') key value_x value_y0 A 1.764052 NaN1 B 0.400157 1.8675582 C 0.978738 NaN3 D 2.240893 -0.977278
Carefully note the placement of NaNs here. If you specify how='left'
, then only keys from left
are used, and missing data from right
is replaced by NaN.
And similarly, for a RIGHT OUTER JOIN, or RIGHT JOIN which is...

...specify how='right'
:
left.merge(right, on='key', how='right') key value_x value_y0 B 0.400157 1.8675581 D 2.240893 -0.9772782 E NaN 0.9500883 F NaN -0.151357
Here, keys from right
are used, and missing data from left
is replaced by NaN.
Finally, for the FULL OUTER JOIN, given by

specify how='outer'
.
left.merge(right, on='key', how='outer') key value_x value_y0 A 1.764052 NaN1 B 0.400157 1.8675582 C 0.978738 NaN3 D 2.240893 -0.9772784 E NaN 0.9500885 F NaN -0.151357
This uses the keys from both frames, and NaNs are inserted for missing rows in both.
The documentation summarizes these various merges nicely:
Other JOINs - LEFT-Excluding, RIGHT-Excluding, and FULL-Excluding/ANTI JOINs
If you need LEFT-Excluding JOINs and RIGHT-Excluding JOINs in two steps.
For LEFT-Excluding JOIN, represented as

Start by performing a LEFT OUTER JOIN and then filtering (excluding!) rows coming from left
only,
(left.merge(right, on='key', how='left', indicator=True) .query('_merge == "left_only"') .drop('_merge', 1)) key value_x value_y0 A 1.764052 NaN2 C 0.978738 NaN
Where,
left.merge(right, on='key', how='left', indicator=True) key value_x value_y _merge0 A 1.764052 NaN left_only1 B 0.400157 1.867558 both2 C 0.978738 NaN left_only3 D 2.240893 -0.977278 both
And similarly, for a RIGHT-Excluding JOIN,

(left.merge(right, on='key', how='right', indicator=True) .query('_merge == "right_only"') .drop('_merge', 1)) key value_x value_y2 E NaN 0.9500883 F NaN -0.151357
Lastly, if you are required to do a merge that only retains keys from the left or right, but not both (IOW, performing an ANTI-JOIN),

You can do this in similar fashion—
(left.merge(right, on='key', how='outer', indicator=True) .query('_merge != "both"') .drop('_merge', 1)) key value_x value_y0 A 1.764052 NaN2 C 0.978738 NaN4 E NaN 0.9500885 F NaN -0.151357
Different names for key columns
If the key columns are named differently—for example, left
has keyLeft
, and right
has keyRight
instead of key
—then you will have to specify left_on
and right_on
as arguments instead of on
:
left2 = left.rename({'key':'keyLeft'}, axis=1)right2 = right.rename({'key':'keyRight'}, axis=1)left2 keyLeft value0 A 1.7640521 B 0.4001572 C 0.9787383 D 2.240893right2 keyRight value0 B 1.8675581 D -0.9772782 E 0.9500883 F -0.151357
left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner') keyLeft value_x keyRight value_y0 B 0.400157 B 1.8675581 D 2.240893 D -0.977278
Avoiding duplicate key column in output
When merging on keyLeft
from left
and keyRight
from right
, if you only want either of the keyLeft
or keyRight
(but not both) in the output, you can start by setting the index as a preliminary step.
left3 = left2.set_index('keyLeft')left3.merge(right2, left_index=True, right_on='keyRight') value_x keyRight value_y0 0.400157 B 1.8675581 2.240893 D -0.977278
Contrast this with the output of the command just before (that is, the output of left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')
), you'll notice keyLeft
is missing. You can figure out what column to keep based on which frame's index is set as the key. This may matter when, say, performing some OUTER JOIN operation.
Merging only a single column from one of the DataFrames
For example, consider
right3 = right.assign(newcol=np.arange(len(right)))right3 key value newcol0 B 1.867558 01 D -0.977278 12 E 0.950088 23 F -0.151357 3
If you are required to merge only "new_val" (without any of the other columns), you can usually just subset columns before merging:
left.merge(right3[['key', 'newcol']], on='key') key value newcol0 B 0.400157 01 D 2.240893 1
If you're doing a LEFT OUTER JOIN, a more performant solution would involve map
:
# left['newcol'] = left['key'].map(right3.set_index('key')['newcol']))left.assign(newcol=left['key'].map(right3.set_index('key')['newcol'])) key value newcol0 A 1.764052 NaN1 B 0.400157 0.02 C 0.978738 NaN3 D 2.240893 1.0
As mentioned, this is similar to, but faster than
left.merge(right3[['key', 'newcol']], on='key', how='left') key value newcol0 A 1.764052 NaN1 B 0.400157 0.02 C 0.978738 NaN3 D 2.240893 1.0
Merging on multiple columns
To join on more than one column, specify a list for on
(or left_on
and right_on
, as appropriate).
left.merge(right, on=['key1', 'key2'] ...)
Or, in the event the names are different,
left.merge(right, left_on=['lkey1', 'lkey2'], right_on=['rkey1', 'rkey2'])
Other useful merge*
operations and functions
Merging a DataFrame with Series on index: See this answer.
Besides
merge
,DataFrame.update
andDataFrame.combine_first
are also used in certain cases to update one DataFrame with another.pd.merge_ordered
is a useful function for ordered JOINs.pd.merge_asof
(read: merge_asOf) is useful for approximate joins.
This section only covers the very basics, and is designed to only whet your appetite. For more examples and cases, see the documentation on merge
, join
, and concat
as well as the links to the function specifications.
Continue Reading
Jump to other topics in Pandas Merging 101 to continue learning:
*You are here.
A supplemental visual view of pd.concat([df0, df1], kwargs)
. Notice that, kwarg axis=0
or axis=1
's meaning is not as intuitive as df.mean()
or df.apply(func)
In this answer, I will consider practical examples.
The first one, is of pandas.concat
.
The second one, of merging dataframes from the index of one and the column of another one.
Considering the following DataFrames
with the same column names:
Preco2018 with size (8784, 5)
Preco 2019 with size (8760, 5)
That have the same column names.
You can combine them using pandas.concat
, by simply
import pandas as pdframes = [Preco2018, Preco2019]df_merged = pd.concat(frames)
Which results in a DataFrame with the following size (17544, 5)
If you want to visualize, it ends up working like this
(Source)
2. Merge by Column and Index
In this part, I will consider a specific case: If one wants to merge the index of one dataframe and the column of another dataframe.
Let's say one has the dataframe Geo
with 54 columns, being one of the columns the Date Data
, which is of type datetime64[ns]
.
And the dataframe Price
that has one column with the price and the index corresponds to the dates
In this specific case, to merge them, one uses pd.merge
merged = pd.merge(Price, Geo, left_index=True, right_on='Data')
Which results in the following dataframe