How to merge two pandas dataframe in parallel (multithreading or multiprocessing) How to merge two pandas dataframe in parallel (multithreading or multiprocessing) pandas pandas

How to merge two pandas dataframe in parallel (multithreading or multiprocessing)


I believe you can use dask.and function merge.

Docs say:

What definitely works?

Cleverly parallelizable operations (also fast):

Join on index: dd.merge(df1, df2, left_index=True, right_index=True)

Or:

Operations requiring a shuffle (slow-ish, unless on index)

Set index: df.set_index(df.x)

Join not on the index: pd.merge(df1, df2, on='name')

You can also check how Create Dask DataFrames.

Example

import pandas as pdleft = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],                      'A': ['A0', 'A1', 'A2', 'A3'],                     'B': ['B0', 'B1', 'B2', 'B3']})right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],                      'C': ['C0', 'C1', 'C2', 'C3'],                      'D': ['D0', 'D1', 'D2', 'D3']})result = pd.merge(left, right, on='key')print result    A   B key   C   D0  A0  B0  K0  C0  D01  A1  B1  K1  C1  D12  A2  B2  K2  C2  D23  A3  B3  K3  C3  D3import dask.dataframe as dd#Construct a dask objects from a pandas objectsleft1 = dd.from_pandas(left, npartitions=3)right1 = dd.from_pandas(right, npartitions=3)#merge on keyprint dd.merge(left1, right1, on='key').compute()    A   B key   C   D0  A3  B3  K3  C3  D31  A1  B1  K1  C1  D10  A2  B2  K2  C2  D21  A0  B0  K0  C0  D0
#first set indexes and then merge by themprint dd.merge(left1.set_index('key').compute(),                right1.set_index('key').compute(),                left_index=True,                right_index=True)      A   B   C   Dkey                K0   A0  B0  C0  D0K1   A1  B1  C1  D1K2   A2  B2  C2  D2K3   A3  B3  C3  D3


You can improve the speed (by a factor of about 3 on the given example) of your merge by making the key column the index of your dataframes and using join instead.

left2 = left.set_index('key')right2 = right.set_index('key')In [46]: %timeit result2 = left2.join(right2)1000 loops, best of 3: 361 µs per loopIn [47]: %timeit result = pd.merge(left, right, on='key')1000 loops, best of 3: 1.01 ms per loop