Difference(s) between merge() and concat() in pandas
A very high level difference is that merge()
is used to combine two (or more) dataframes on the basis of values of common columns (indices can also be used, use left_index=True
and/or right_index=True
), and concat()
is used to append one (or more) dataframes one below the other (or sideways, depending on whether the axis
option is set to 0 or 1).
join()
is used to merge 2 dataframes on the basis of the index; instead of using merge()
with the option left_index=True
we can use join()
.
For example:
df1 = pd.DataFrame({'Key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)})df1: Key data10 b 01 b 12 a 23 c 34 a 45 a 56 b 6df2 = pd.DataFrame({'Key': ['a', 'b', 'd'], 'data2': range(3)})df2: Key data20 a 01 b 12 d 2#Merge# The 2 dataframes are merged on the basis of values in column "Key" as it is # a common column in 2 dataframespd.merge(df1, df2) Key data1 data20 b 0 11 b 1 12 b 6 13 a 2 04 a 4 05 a 5 0#Concat# df2 dataframe is appended at the bottom of df1 pd.concat([df1, df2]) Key data1 data20 b 0 NaN1 b 1 NaN2 a 2 NaN3 c 3 NaN4 a 4 NaN5 a 5 NaN6 b 6 NaN0 a Nan 01 b Nan 12 d Nan 2
At a high level:
.concat()
simply stacks multipleDataFrame
together eithervertically, or stitches horizontally after aligning on index.merge()
first aligns twoDataFrame
' selected common column(s) orindex, and then pick up the remaining columns from the aligned rows of eachDataFrame
.
More specifically, .concat()
:
- Is a top-level pandas function
- Combines two or more pandas
DataFrame
vertically or horizontally - Aligns only on the index when combining horizontally
- Errors when any of the
DataFrame
contains a duplicate index. - Defaults to outer join with the option for inner join
And .merge()
:
- Exists both as a top-level pandas function and a
DataFrame
method (as of pandas 1.0) - Combines exactly two
DataFrame
horizontally - Aligns the calling
DataFrame
's column(s) or index with the otherDataFrame
's column(s) or index - Handles duplicate values on the joining columns or index byperforming a cartesian product
- Defaults to inner join with options for left, outer, and right
Note that when performing pd.merge(left, right)
, if left
has two rows containing the same values from the joining columns or index, each row will combine with right
's corresponding row(s) resulting in a cartesian product. On the other hand, if .concat()
is used to combine columns, we need to make sure no duplicated index exists in either DataFrame
.
Practically speaking:
- Consider
.concat()
first when combining homogeneousDataFrame
, whileconsider.merge()
first when combining complementaryDataFrame
. - If need to merge vertically, go with
.concat()
. If need to mergehorizontally via columns, go with.merge()
, which by default merge on the columns in common.
Reference: Pandas 1.x Cookbook
pd.concat
takes an Iterable
as its argument. Hence, it cannot take DataFrame
s directly as its argument. Also Dimension
s of the DataFrame
should match along axis while concatenating.
pd.merge
can take DataFrame
s as its argument, and is used to combine two DataFrame
s with same columns or index, which can't be done with pd.concat
since it will show the repeated column in the DataFrame.
Whereas join can be used to join two DataFrame
s with different indices.