Apply function to grouped data frame in Dask: How do you specify the grouped Dataframe as argument in the function?
The function you provide to groupby-apply should take a Pandas dataframe or series as input and ideally return one (or a scalar value) as output. Extra parameters are fine, but they should be secondary, not the first argument. This is the same in both Pandas and Dask dataframe.
def func(df, x=None): # do whatever you want here # the input to this function will have all the same first name return pd.DataFrame({'x': [x] * len(df), 'count': len(df), 'first_name': df.first_name})
You can then call df.groupby as normal
import pandas as pdimport dask.dataframe as dddf = pd.DataFrame({'first_name':['Alice', 'Alice', 'Bob'], 'last_name': ['Adams', 'Jones', 'Smith']})ddf = dd.from_pandas(df, npartitions=2)ddf.groupby('first_name').apply(func, x=3).compute()
This will produce the same output in either pandas or dask.dataframe
count first_name x0 2 Alice 31 2 Alice 32 1 Bob 3
With a little bit of guesswork, I think that the following is what you are after.
def mapper(d): def contraster(x, DF=d): matches = DF.apply(lambda row: fuzz.partial_ratio(row['last_name'], x) >= 50, axis = 1) return [d.ID.iloc[i] for i, x in enumerate(matches) if x] d['out'] = d.apply(lambda row: contraster(row['last_name']), axis =1) return ddf.groupby('first_name').apply(mapper).compute()
Applied to your data, you get:
ID first_name last_name out2 X Danae Smith [X]4 12 Jacke Toro [12]0 X Jake Del Toro [X]1 U John Foster [U]5 13 Jon Froster [13]3 Y Beatriz Patterson [Y]
i.e., because you group by first_name, each group only contains one item, which matches only with itself.
If, however, you has some first_name values that were in multiple rows, you would get matches:
entities = pd.DataFrame( {'first_name':['Jake','Jake', 'Jake', 'John'], 'last_name': ['Del Toro', 'Toro', 'Smith' 'Froster'], 'ID':['Z','U','X','Y']})
Output:
ID first_name last_name out0 Z Jake Del Toro [Z, U]1 U Jake Toro [Z, U]2 X Jake Smith [X]3 Y John Froster [Y]
If you do not require exact matches on the first_name, then maybe you need to sort/set index by the first_name and use map_partitions
in a similar way. In that case, you will need to reform your question.