how to parallelize many (fuzzy) string comparisons using apply in Pandas?

python pandas parallel-processing dask fuzzywuzzy

You can parallelize this with Dask.dataframe.

>>> dmaster = dd.from_pandas(master, npartitions=4)>>> dmaster['my_value'] = dmaster.original.apply(lambda x: helper(x, slave), name='my_value'))>>> dmaster.compute()                  original  my_value0  this is a nice sentence         21      this is another one         32    stackoverflow is nice         1

Additionally, you should think about the tradeoffs between using threads vs processes here. Your fuzzy string matching almost certainly doesn't release the GIL, so you won't get any benefit from using multiple threads. However, using processes will cause data to serialize and move around your machine, which might slow things down a bit.

You can experiment between using threads and processes or a distributed system by managing the get= keyword argument to the compute() method.

import dask.multiprocessingimport dask.threaded>>> dmaster.compute(get=dask.threaded.get)  # this is default for dask.dataframe>>> dmaster.compute(get=dask.multiprocessing.get)  # try processes instead

python pandas parallel-processing dask fuzzywuzzy

I'm working on something similar and I wanted to provide a more complete working solution for anyone else you might stumble upon this question. @MRocklin unfortunately has some syntax errors in the code snippets provided. I am no expert with Dask, so I can't comment on some performance considerations, but this should accomplish your task just as @MRocklin has suggested. This is using Dask version 0.17.2 and Pandas version 0.22.0:

import dask.dataframe as ddimport dask.multiprocessingimport dask.threadedfrom fuzzywuzzy import fuzzimport pandas as pdmaster= pd.DataFrame({'original':['this is a nice sentence','this is another one','stackoverflow is nice']})slave= pd.DataFrame({'name':['hello world','congratulations','this is a nice sentence ','this is another one','stackoverflow is nice'],'my_value': [1,2,3,4,5]})def fuzzy_score(str1, str2):    return fuzz.token_set_ratio(str1, str2)def helper(orig_string, slave_df):    slave_df['score'] = slave_df.name.apply(lambda x: fuzzy_score(x,orig_string))    #return my_value corresponding to the highest score    return slave_df.loc[slave_df.score.idxmax(),'my_value']dmaster = dd.from_pandas(master, npartitions=4)dmaster['my_value'] = dmaster.original.apply(lambda x: helper(x, slave),meta=('x','f8'))

Then, obtain your results (like in this interpreter session):

In [6]: dmaster.compute(get=dask.multiprocessing.get)                                             Out[6]:                                                            original  my_value             0  this is a nice sentence         3             1      this is another one         4             2    stackoverflow is nice         5

python pandas parallel-processing dask fuzzywuzzy

These answers are based on an older API. Some newer code:

dmaster = dd.from_pandas(master, npartitions=4)dmaster['my_value'] = dmaster.original.apply(lambda x: helper(x, slave),meta=('x','f8'))dmaster.compute(scheduler='processes')

Personally I'd ditch that apply call to fuzzy_score in the helper function and just perform the operation there.

You can alter the scheduler using these tips.

CodeHunter

how to parallelize many (fuzzy) string comparisons using apply in Pandas?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last