How to parallelize groupby() in dask? How to parallelize groupby() in dask? pandas pandas

How to parallelize groupby() in dask?


By default, Dask will work with multi-threaded tasks which means it uses a single processor on your computer. (Note that using dask is nevertheless interesting if you have data that can't fit in memory)

If you want to use several processors to compute your operation, you have to use a different scheduler:

from dask import dataframe as ddfrom dask.distributed import LocalCluster, Clientdf = dd.read_csv("data.csv")def group(num_workers):     start = time.time()     res = df.groupby("name").agg("count").compute(num_workers=num_workers)     end = time.time()     return res, end-startprint(group(4))clust = LocalCluster()clt = Client(clust, set_as_default=True) print(group(4)) 

Here, I create a local cluster using 4 parallel processes (because I have a quadcore) and then set a default scheduling client that will use this local cluster to perform the Dask operations. With a CSV two columns file of 1.5 Gb, the standard groupby takes around 35 seconds on my laptop whereas the multiprocess one only takes around 22 seconds.