How to merge two pandas dataframe in parallel (multithreading or multiprocessing)

python multithreading pandas parallel-processing multiprocessing

I believe you can use dask.and function merge.

What definitely works?

Cleverly parallelizable operations (also fast):
Join on index: dd.merge(df1, df2, left_index=True, right_index=True)

Or:

Operations requiring a shuffle (slow-ish, unless on index)
Set index: df.set_index(df.x)
Join not on the index: pd.merge(df1, df2, on='name')

You can also check how Create Dask DataFrames.

Example

import pandas as pdleft = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],                      'A': ['A0', 'A1', 'A2', 'A3'],                     'B': ['B0', 'B1', 'B2', 'B3']})right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],                      'C': ['C0', 'C1', 'C2', 'C3'],                      'D': ['D0', 'D1', 'D2', 'D3']})result = pd.merge(left, right, on='key')print result    A   B key   C   D0  A0  B0  K0  C0  D01  A1  B1  K1  C1  D12  A2  B2  K2  C2  D23  A3  B3  K3  C3  D3import dask.dataframe as dd#Construct a dask objects from a pandas objectsleft1 = dd.from_pandas(left, npartitions=3)right1 = dd.from_pandas(right, npartitions=3)#merge on keyprint dd.merge(left1, right1, on='key').compute()    A   B key   C   D0  A3  B3  K3  C3  D31  A1  B1  K1  C1  D10  A2  B2  K2  C2  D21  A0  B0  K0  C0  D0

#first set indexes and then merge by themprint dd.merge(left1.set_index('key').compute(),                right1.set_index('key').compute(),                left_index=True,                right_index=True)      A   B   C   Dkey                K0   A0  B0  C0  D0K1   A1  B1  C1  D1K2   A2  B2  C2  D2K3   A3  B3  C3  D3

python multithreading pandas parallel-processing multiprocessing

You can improve the speed (by a factor of about 3 on the given example) of your merge by making the key column the index of your dataframes and using join instead.

left2 = left.set_index('key')right2 = right.set_index('key')In [46]: %timeit result2 = left2.join(right2)1000 loops, best of 3: 361 µs per loopIn [47]: %timeit result = pd.merge(left, right, on='key')1000 loops, best of 3: 1.01 ms per loop

CodeHunter

How to merge two pandas dataframe in parallel (multithreading or multiprocessing)

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last