Improve Pandas Merge performance Improve Pandas Merge performance python python

Improve Pandas Merge performance


set_index on merging column does indeed speed this up. Below is a slightly more realistic version of julien-marrec's Answer.

import pandas as pdimport numpy as npmyids=np.random.choice(np.arange(10000000), size=1000000, replace=False)df1 = pd.DataFrame(myids, columns=['A'])df1['B'] = np.random.randint(0,1000,(1000000))df2 = pd.DataFrame(np.random.permutation(myids), columns=['A2'])df2['B2'] = np.random.randint(0,1000,(1000000))%%timeit    x = df1.merge(df2, how='left', left_on='A', right_on='A2')   #1 loop, best of 3: 664 ms per loop%%timeit      x = df1.set_index('A').join(df2.set_index('A2'), how='left') #1 loop, best of 3: 354 ms per loop%%time     df1.set_index('A', inplace=True)    df2.set_index('A2', inplace=True)#Wall time: 16 ms%%timeit    x = df1.join(df2, how='left')  #10 loops, best of 3: 80.4 ms per loop

When the column to be joined has integers not in the same order on both tables you can still expect a great speed up of 8 times.


I suggest that you set your merge columns as index, and use df1.join(df2) instead of merge, it's much faster.

Here's some example including profiling:

In [1]:import pandas as pdimport numpy as npdf1 = pd.DataFrame(np.arange(1000000), columns=['A'])df1['B'] = np.random.randint(0,1000,(1000000))df2 = pd.DataFrame(np.arange(1000000), columns=['A2'])df2['B2'] = np.random.randint(0,1000,(1000000))

Here's a regular left merge on A and A2:

In [2]: %%timeit        x = df1.merge(df2, how='left', left_on='A', right_on='A2')1 loop, best of 3: 441 ms per loop

Here's the same, using join:

In [3]: %%timeit        x = df1.set_index('A').join(df2.set_index('A2'), how='left')1 loop, best of 3: 184 ms per loop

Now obviously if you can set the index before looping, the gain in terms of time will be much greater:

# Do this before loopingIn [4]: %%timedf1.set_index('A', inplace=True)df2.set_index('A2', inplace=True)CPU times: user 9.78 ms, sys: 9.31 ms, total: 19.1 msWall time: 16.8 ms

Then in the loop, you'll get something that in this case is 30 times faster:

In [5]: %%timeit        x = df1.join(df2, how='left')100 loops, best of 3: 14.3 ms per loop


I don't know if this deserved a new answer but personally, the following tricks helped me improve a bit more the joins I had to do on big DataFrames (millions of rows and hundreds of columns):

  1. Beside using set_index(index, inplace=True), you may want to sort it using sort_index(inplace=True). This speeds up a lot the join if your index is not ordered.For example, creating the DataFrames with
import randomimport pandas as pdimport numpy as npnbre_items = 100000ids = np.arange(nbre_items)random.shuffle(ids)df1 = pd.DataFrame({"id": ids})df1['value'] = 1df1.set_index("id", inplace=True)random.shuffle(ids)df2 = pd.DataFrame({"id": ids})df2['value2'] = 2df2.set_index("id", inplace=True)

I got the following results:

%timeit df1.join(df2)13.2 ms ± 349 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

And after sorting the index (which takes a limited amount of time):

df1.sort_index(inplace=True)df2.sort_index(inplace=True)%timeit df1.join(df2)764 µs ± 17.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
  1. You can split one of your DataFrames in multiple ones with fewer columns. This trick gave me mixed results so be cautious when using it.For example:
for i in range(0, df2.shape[1], 100):    df1 = df1.join(df2.iloc[:, i:min(df2.shape[1], (i + 100))], how='outer')