Naming returned columns in Pandas aggregate function? [duplicate] Naming returned columns in Pandas aggregate function? [duplicate] python python

Naming returned columns in Pandas aggregate function? [duplicate]


For pandas >= 0.25

The functionality to name returned aggregate columns has been reintroduced in the master branch and is targeted for pandas 0.25. The new syntax is .agg(new_col_name=('col_name', 'agg_func'). Detailed example from the PR linked above:

In [2]: df = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],   ...:                    'height': [9.1, 6.0, 9.5, 34.0],   ...:                    'weight': [7.9, 7.5, 9.9, 198.0]})   ...:In [3]: dfOut[3]:  kind  height  weight0  cat     9.1     7.91  dog     6.0     7.52  cat     9.5     9.93  dog    34.0   198.0In [4]: df.groupby('kind').agg(min_height=('height', 'min'),                                max_weight=('weight', 'max'))Out[4]:      min_height  max_weightkindcat          9.1         9.9dog          6.0       198.0

It will also be possible to use multiple lambda expressions with this syntax and the two-step rename syntax I suggested earlier (below) as per this PR. Again, copying from the example in the PR:

In [2]: df = pd.DataFrame({"A": ['a', 'a'], 'B': [1, 2], 'C': [3, 4]})In [3]: df.groupby("A").agg({'B': [lambda x: 0, lambda x: 1]})Out[3]:         B  <lambda> <lambda 1>Aa        0          1

and then .rename(), or in one go:

In [4]: df.groupby("A").agg(b=('B', lambda x: 0), c=('B', lambda x: 1))Out[4]:   b  cAa  0  0

For pandas < 0.25

The currently accepted answer by unutbu describes are great way of doing this in pandas versions <= 0.20. However, as of pandas 0.20, using this method raises a warning indicating that the syntax will not be available in future versions of pandas.

Series:

FutureWarning: using a dict on a Series for aggregation is deprecated and will be removed in a future version

DataFrames:

FutureWarning: using a dict with renaming is deprecated and will be removed in a future version

According to the pandas 0.20 changelog, the recommended way of renaming columns while aggregating is as follows.

# Create a sample data framedf = pd.DataFrame({'A': [1, 1, 1, 2, 2],                   'B': range(5),                   'C': range(5)})# ==== SINGLE COLUMN (SERIES) ====# Syntax soon to be deprecateddf.groupby('A').B.agg({'foo': 'count'})# Recommended replacement syntaxdf.groupby('A').B.agg(['count']).rename(columns={'count': 'foo'})# ==== MULTI COLUMN ====# Syntax soon to be deprecateddf.groupby('A').agg({'B': {'foo': 'sum'}, 'C': {'bar': 'min'}})# Recommended replacement syntaxdf.groupby('A').agg({'B': 'sum', 'C': 'min'}).rename(columns={'B': 'foo', 'C': 'bar'})# As the recommended syntax is more verbose, parentheses can# be used to introduce line breaks and increase readability(df.groupby('A')    .agg({'B': 'sum', 'C': 'min'})    .rename(columns={'B': 'foo', 'C': 'bar'}))

Please see the 0.20 changelog for additional details.

Update 2017-01-03 in response to @JunkMechanic's comment.

With the old style dictionary syntax, it was possible to pass multiple lambda functions to .agg, since these would be renamed with the key in the passed dictionary:

>>> df.groupby('A').agg({'B': {'min': lambda x: x.min(), 'max': lambda x: x.max()}})    B      max minA        1   2   02   4   3

Multiple functions can also be passed to a single column as a list:

>>> df.groupby('A').agg({'B': [np.min, np.max]})     B       amin amaxA          1    0    22    3    4

However, this does not work with lambda functions, since they are anonymous and all return <lambda>, which causes a name collision:

>>> df.groupby('A').agg({'B': [lambda x: x.min(), lambda x: x.max]})SpecificationError: Function names must be unique, found multiple named <lambda>

To avoid the SpecificationError, named functions can be defined a priori instead of using lambda. Suitable function names also avoid calling .rename on the data frame afterwards. These functions can be passed with the same list syntax as above:

>>> def my_min(x):>>>     return x.min()>>> def my_max(x):>>>     return x.max()>>> df.groupby('A').agg({'B': [my_min, my_max]})       B         my_min my_maxA              1      0      22      3      4


This will drop the outermost level from the hierarchical column index:

df = data.groupby(...).agg(...)df.columns = df.columns.droplevel(0)

If you'd like to keep the outermost level, you can use the ravel() function on the multi-level column to form new labels:

df.columns = ["_".join(x) for x in df.columns.ravel()]

For example:

import pandas as pdimport pandas.rpy.common as comimport numpy as npdata = com.load_data('Loblolly')print(data.head())#     height  age Seed# 1     4.51    3  301# 15   10.89    5  301# 29   28.72   10  301# 43   41.74   15  301# 57   52.70   20  301df = data.groupby('Seed').agg(    {'age':['sum'],     'height':['mean', 'std']})print(df.head())#       age     height           #       sum        std       mean# Seed                           # 301    78  22.638417  33.246667# 303    78  23.499706  34.106667# 305    78  23.927090  35.115000# 307    78  22.222266  31.328333# 309    78  23.132574  33.781667df.columns = df.columns.droplevel(0)print(df.head())

yields

      sum        std       meanSeed                           301    78  22.638417  33.246667303    78  23.499706  34.106667305    78  23.927090  35.115000307    78  22.222266  31.328333309    78  23.132574  33.781667

Alternatively, to keep the first level of the index:

df = data.groupby('Seed').agg(    {'age':['sum'],     'height':['mean', 'std']})df.columns = ["_".join(x) for x in df.columns.ravel()]

yields

      age_sum   height_std  height_meanSeed                           301        78    22.638417    33.246667303        78    23.499706    34.106667305        78    23.927090    35.115000307        78    22.222266    31.328333309        78    23.132574    33.781667


I agree with the OP that it seems more natural and consistent to name and define the output columns in the same place (e.g. as is done with tidyverse's summarize in R), but a work-around in pandas for now is to create the new columns with desired names via assign before doing the aggregation:

data.assign(    f=data['column1'],    mean=data['column2'],    std=data['column2']).groupby('Country').agg(dict(f=sum, mean=np.mean, std=np.std)).reset_index()

(Using reset_index turns 'Country', 'f', 'mean', and 'std' all into regular columns with a separate integer index.)