Apply multiple functions to multiple groupby columns

python group-by aggregate-functions pandas

The second half of the currently accepted answer is outdated and has two deprecations. First and most important, you can no longer pass a dictionary of dictionaries to the agg groupby method. Second, never use .ix.

If you desire to work with two separate columns at the same time I would suggest using the apply method which implicitly passes a DataFrame to the applied function. Let's use a similar dataframe as the one from above

df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))df['group'] = [0, 0, 1, 1]df          a         b         c         d  group0  0.418500  0.030955  0.874869  0.145641      01  0.446069  0.901153  0.095052  0.487040      02  0.843026  0.936169  0.926090  0.041722      13  0.635846  0.439175  0.828787  0.714123      1

A dictionary mapped from column names to aggregation functions is still a perfectly good way to perform an aggregation.

df.groupby('group').agg({'a':['sum', 'max'],                          'b':'mean',                          'c':'sum',                          'd': lambda x: x.max() - x.min()})              a                   b         c         d            sum       max      mean       sum  <lambda>group                                                  0      0.864569  0.446069  0.466054  0.969921  0.3413991      1.478872  0.843026  0.687672  1.754877  0.672401

If you don't like that ugly lambda column name, you can use a normal function and supply a custom name to the special __name__ attribute like this:

def max_min(x):    return x.max() - x.min()max_min.__name__ = 'Max minus Min'df.groupby('group').agg({'a':['sum', 'max'],                          'b':'mean',                          'c':'sum',                          'd': max_min})              a                   b         c             d            sum       max      mean       sum Max minus Mingroup                                                      0      0.864569  0.446069  0.466054  0.969921      0.3413991      1.478872  0.843026  0.687672  1.754877      0.672401

Using `apply` and returning a Series

Now, if you had multiple columns that needed to interact together then you cannot use agg, which implicitly passes a Series to the aggregating function. When using apply the entire group as a DataFrame gets passed into the function.

I recommend making a single custom function that returns a Series of all the aggregations. Use the Series index as labels for the new columns:

def f(x):    d = {}    d['a_sum'] = x['a'].sum()    d['a_max'] = x['a'].max()    d['b_mean'] = x['b'].mean()    d['c_d_prodsum'] = (x['c'] * x['d']).sum()    return pd.Series(d, index=['a_sum', 'a_max', 'b_mean', 'c_d_prodsum'])df.groupby('group').apply(f)         a_sum     a_max    b_mean  c_d_prodsumgroup                                           0      0.864569  0.446069  0.466054     0.1737111      1.478872  0.843026  0.687672     0.630494

If you are in love with MultiIndexes, you can still return a Series with one like this:

    def f_mi(x):        d = []        d.append(x['a'].sum())        d.append(x['a'].max())        d.append(x['b'].mean())        d.append((x['c'] * x['d']).sum())        return pd.Series(d, index=[['a', 'a', 'b', 'c_d'],                                    ['sum', 'max', 'mean', 'prodsum']])df.groupby('group').apply(f_mi)              a                   b       c_d            sum       max      mean   prodsumgroup                                        0      0.864569  0.446069  0.466054  0.1737111      1.478872  0.843026  0.687672  0.630494

python group-by aggregate-functions pandas

For the first part you can pass a dict of column names for keys and a list of functions for the values:

In [28]: dfOut[28]:          A         B         C         D         E  GRP0  0.395670  0.219560  0.600644  0.613445  0.242893    01  0.323911  0.464584  0.107215  0.204072  0.927325    02  0.321358  0.076037  0.166946  0.439661  0.914612    13  0.133466  0.447946  0.014815  0.130781  0.268290    1In [26]: f = {'A':['sum','mean'], 'B':['prod']}In [27]: df.groupby('GRP').agg(f)Out[27]:            A                   B          sum      mean      prodGRP0    0.719580  0.359790  0.1020041    0.454824  0.227412  0.034060

UPDATE 1:

Because the aggregate function works on Series, references to the other column names are lost. To get around this, you can reference the full dataframe and index it using the group indices within the lambda function.

Here's a hacky workaround:

In [67]: f = {'A':['sum','mean'], 'B':['prod'], 'D': lambda g: df.loc[g.index].E.sum()}In [69]: df.groupby('GRP').agg(f)Out[69]:            A                   B         D          sum      mean      prod  <lambda>GRP0    0.719580  0.359790  0.102004  1.1702191    0.454824  0.227412  0.034060  1.182901

Here, the resultant 'D' column is made up of the summed 'E' values.

UPDATE 2:

Here's a method that I think will do everything you ask. First make a custom lambda function. Below, g references the group. When aggregating, g will be a Series. Passing g.index to df.ix[] selects the current group from df. I then test if column C is less than 0.5. The returned boolean series is passed to g[] which selects only those rows meeting the criteria.

In [95]: cust = lambda g: g[df.loc[g.index]['C'] < 0.5].sum()In [96]: f = {'A':['sum','mean'], 'B':['prod'], 'D': {'my name': cust}}In [97]: df.groupby('GRP').agg(f)Out[97]:            A                   B         D          sum      mean      prod   my nameGRP0    0.719580  0.359790  0.102004  0.2040721    0.454824  0.227412  0.034060  0.570441

python group-by aggregate-functions pandas

`Pandas >= 0.25.0`, named aggregations

Since pandas version 0.25.0 or higher, we are moving away from the dictionary based aggregation and renaming, and moving towards named aggregations which accepts a tuple. Now we can simultaneously aggregate + rename to a more informative column name:

Example:

df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))df['group'] = [0, 0, 1, 1]          a         b         c         d  group0  0.521279  0.914988  0.054057  0.125668      01  0.426058  0.828890  0.784093  0.446211      02  0.363136  0.843751  0.184967  0.467351      13  0.241012  0.470053  0.358018  0.525032      1

Apply GroupBy.agg with named aggregation:

df.groupby('group').agg(             a_sum=('a', 'sum'),             a_mean=('a', 'mean'),             b_mean=('b', 'mean'),             c_sum=('c', 'sum'),             d_range=('d', lambda x: x.max() - x.min()))          a_sum    a_mean    b_mean     c_sum   d_rangegroup                                                  0      0.947337  0.473668  0.871939  0.838150  0.3205431      0.604149  0.302074  0.656902  0.542985  0.057681

CodeHunter

Apply multiple functions to multiple groupby columns

Using `apply` and returning a Series

`Pandas >= 0.25.0`, named aggregations

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last

Apply multiple functions to multiple groupby columns

Using apply and returning a Series

Pandas >= 0.25.0, named aggregations

Recent Posts

Using `apply` and returning a Series

`Pandas >= 0.25.0`, named aggregations