Apply function to grouped data frame in Dask: How do you specify the grouped Dataframe as argument in the function? Apply function to grouped data frame in Dask: How do you specify the grouped Dataframe as argument in the function? pandas pandas

Apply function to grouped data frame in Dask: How do you specify the grouped Dataframe as argument in the function?


The function you provide to groupby-apply should take a Pandas dataframe or series as input and ideally return one (or a scalar value) as output. Extra parameters are fine, but they should be secondary, not the first argument. This is the same in both Pandas and Dask dataframe.

def func(df, x=None):    # do whatever you want here    # the input to this function will have all the same first name    return pd.DataFrame({'x': [x] * len(df),                         'count': len(df),                         'first_name': df.first_name})

You can then call df.groupby as normal

import pandas as pdimport dask.dataframe as dddf = pd.DataFrame({'first_name':['Alice', 'Alice', 'Bob'],                   'last_name': ['Adams', 'Jones', 'Smith']})ddf = dd.from_pandas(df, npartitions=2)ddf.groupby('first_name').apply(func, x=3).compute()

This will produce the same output in either pandas or dask.dataframe

   count first_name  x0      2      Alice  31      2      Alice  32      1        Bob  3


With a little bit of guesswork, I think that the following is what you are after.

def mapper(d):    def contraster(x, DF=d):        matches = DF.apply(lambda row: fuzz.partial_ratio(row['last_name'], x) >= 50, axis = 1)        return [d.ID.iloc[i] for i, x in enumerate(matches) if x]    d['out'] = d.apply(lambda row:         contraster(row['last_name']), axis =1)    return ddf.groupby('first_name').apply(mapper).compute()

Applied to your data, you get:

   ID first_name  last_name   out2   X      Danae      Smith   [X]4  12      Jacke       Toro  [12]0   X       Jake   Del Toro   [X]1   U       John     Foster   [U]5  13        Jon    Froster  [13]3   Y    Beatriz  Patterson   [Y]

i.e., because you group by first_name, each group only contains one item, which matches only with itself.

If, however, you has some first_name values that were in multiple rows, you would get matches:

entities = pd.DataFrame(    {'first_name':['Jake','Jake', 'Jake', 'John'],     'last_name': ['Del Toro', 'Toro', 'Smith'                   'Froster'],     'ID':['Z','U','X','Y']})

Output:

  ID first_name last_name     out0  Z       Jake  Del Toro  [Z, U]1  U       Jake      Toro  [Z, U]2  X       Jake     Smith     [X]3  Y       John   Froster     [Y]

If you do not require exact matches on the first_name, then maybe you need to sort/set index by the first_name and use map_partitions in a similar way. In that case, you will need to reform your question.