Difference(s) between merge() and concat() in pandas Difference(s) between merge() and concat() in pandas python python

Difference(s) between merge() and concat() in pandas


A very high level difference is that merge() is used to combine two (or more) dataframes on the basis of values of common columns (indices can also be used, use left_index=True and/or right_index=True), and concat() is used to append one (or more) dataframes one below the other (or sideways, depending on whether the axis option is set to 0 or 1).

join() is used to merge 2 dataframes on the basis of the index; instead of using merge() with the option left_index=True we can use join().

For example:

df1 = pd.DataFrame({'Key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)})df1:   Key  data10   b   01   b   12   a   23   c   34   a   45   a   56   b   6df2 = pd.DataFrame({'Key': ['a', 'b', 'd'], 'data2': range(3)})df2:    Key data20   a   01   b   12   d   2#Merge# The 2 dataframes are merged on the basis of values in column "Key" as it is # a common column in 2 dataframespd.merge(df1, df2)   Key data1 data20   b    0    11   b    1    12   b    6    13   a    2    04   a    4    05   a    5    0#Concat# df2 dataframe is appended at the bottom of df1 pd.concat([df1, df2])   Key data1 data20   b   0     NaN1   b   1     NaN2   a   2     NaN3   c   3     NaN4   a   4     NaN5   a   5     NaN6   b   6     NaN0   a   Nan   01   b   Nan   12   d   Nan   2


At a high level:

  • .concat() simply stacks multiple DataFrame together eithervertically, or stitches horizontally after aligning on index
  • .merge() first aligns two DataFrame' selected common column(s) orindex, and then pick up the remaining columns from the aligned rows of each DataFrame.

More specifically, .concat():

  • Is a top-level pandas function
  • Combines two or more pandas DataFrame vertically or horizontally
  • Aligns only on the index when combining horizontally
  • Errors when any of the DataFrame contains a duplicate index.
  • Defaults to outer join with the option for inner join

And .merge():

  • Exists both as a top-level pandas function and a DataFrame method (as of pandas 1.0)
  • Combines exactly two DataFrame horizontally
  • Aligns the calling DataFrame's column(s) or index with the otherDataFrame's column(s) or index
  • Handles duplicate values on the joining columns or index byperforming a cartesian product
  • Defaults to inner join with options for left, outer, and right

Note that when performing pd.merge(left, right), if left has two rows containing the same values from the joining columns or index, each row will combine with right's corresponding row(s) resulting in a cartesian product. On the other hand, if .concat() is used to combine columns, we need to make sure no duplicated index exists in either DataFrame.

Practically speaking:

  • Consider .concat() first when combining homogeneous DataFrame, whileconsider .merge() first when combining complementary DataFrame.
  • If need to merge vertically, go with .concat(). If need to mergehorizontally via columns, go with .merge(), which by default merge on the columns in common.

Reference: Pandas 1.x Cookbook


pd.concat takes an Iterable as its argument. Hence, it cannot take DataFrames directly as its argument. Also Dimensions of the DataFrame should match along axis while concatenating.

pd.merge can take DataFrames as its argument, and is used to combine two DataFrames with same columns or index, which can't be done with pd.concat since it will show the repeated column in the DataFrame.

Whereas join can be used to join two DataFrames with different indices.