How to merge two pandas dataframe in parallel (multithreading or multiprocessing)
I believe you can use dask.and function merge
.
Docs say:
What definitely works?
Cleverly parallelizable operations (also fast):
Join on index: dd.merge(df1, df2, left_index=True, right_index=True)
Or:
Operations requiring a shuffle (slow-ish, unless on index)
Set index: df.set_index(df.x)
Join not on the index: pd.merge(df1, df2, on='name')
You can also check how Create Dask DataFrames.
Example
import pandas as pdleft = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'], 'A': ['A0', 'A1', 'A2', 'A3'], 'B': ['B0', 'B1', 'B2', 'B3']})right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'], 'C': ['C0', 'C1', 'C2', 'C3'], 'D': ['D0', 'D1', 'D2', 'D3']})result = pd.merge(left, right, on='key')print result A B key C D0 A0 B0 K0 C0 D01 A1 B1 K1 C1 D12 A2 B2 K2 C2 D23 A3 B3 K3 C3 D3import dask.dataframe as dd#Construct a dask objects from a pandas objectsleft1 = dd.from_pandas(left, npartitions=3)right1 = dd.from_pandas(right, npartitions=3)#merge on keyprint dd.merge(left1, right1, on='key').compute() A B key C D0 A3 B3 K3 C3 D31 A1 B1 K1 C1 D10 A2 B2 K2 C2 D21 A0 B0 K0 C0 D0
#first set indexes and then merge by themprint dd.merge(left1.set_index('key').compute(), right1.set_index('key').compute(), left_index=True, right_index=True) A B C Dkey K0 A0 B0 C0 D0K1 A1 B1 C1 D1K2 A2 B2 C2 D2K3 A3 B3 C3 D3
You can improve the speed (by a factor of about 3 on the given example) of your merge by making the key
column the index of your dataframes and using join
instead.
left2 = left.set_index('key')right2 = right.set_index('key')In [46]: %timeit result2 = left2.join(right2)1000 loops, best of 3: 361 µs per loopIn [47]: %timeit result = pd.merge(left, right, on='key')1000 loops, best of 3: 1.01 ms per loop