What is the pandas equivalent of dplyr summarize/aggregate by multiple functions?
The equivalent of
df %>% groupby(col1) %>% summarize(col2_agg=max(col2), col3_agg=min(col3))
is
df.groupby('col1').agg({'col2': 'max', 'col3': 'min'})
which returns
col2 col3col1 1 5 -52 9 -9
The returning object is a pandas.DataFrame with an index called col1
and columns named col2
and col3
. By default, when you group your data pandas sets the grouping column(s) as index for efficient access and modification. However, if you don't want that, there are two alternatives to set col1
as a column.
Pass
as_index=False
:df.groupby('col1', as_index=False).agg({'col2': 'max', 'col3': 'min'})
Call
reset_index
:df.groupby('col1').agg({'col2': 'max', 'col3': 'min'}).reset_index()
both yield
col1 col2 col3 1 5 -5 2 9 -9
You can also pass multiple functions to groupby.agg
.
agg_df = df.groupby('col1').agg({'col2': ['max', 'min', 'std'], 'col3': ['size', 'std', 'mean', 'max']})
This also returns a DataFrame but now it has a MultiIndex for columns.
col2 col3 max min std size std mean maxcol1 1 5 1 1.581139 5 1.581139 -3 -12 9 0 3.535534 5 3.535534 -6 0
MultiIndex is very handy for selection and grouping. Here are some examples:
agg_df['col2'] # select the second column max min stdcol1 1 5 1 1.5811392 9 0 3.535534agg_df[('col2', 'max')] # select the maximum of the second columnOut: col11 52 9Name: (col2, max), dtype: int64agg_df.xs('max', axis=1, level=1) # select the maximum of all columnsOut: col2 col3col1 1 5 -12 9 0
Earlier (before version 0.20.0) it was possible to use dictionaries for renaming the columns in the agg
call. For example
df.groupby('col1')['col2'].agg({'max_col2': 'max'})
would return the maximum of the second column as max_col2
:
max_col2col1 1 52 9
However, it was deprecated in favor of the rename method:
df.groupby('col1')['col2'].agg(['max']).rename(columns={'max': 'col2_max'}) col2_maxcol1 1 52 9
It can get verbose for a DataFrame like agg_df
defined above. You can use a renaming function to flatten those levels in that case:
agg_df.columns = ['_'.join(col) for col in agg_df.columns] col2_max col2_min col2_std col3_size col3_std col3_mean col3_maxcol1 1 5 1 1.581139 5 1.581139 -3 -12 9 0 3.535534 5 3.535534 -6 0
For operations like groupby().summarize(newcolumn=max(col2 * col3))
, you can still use agg by first adding a new column with assign
.
df.assign(new_col=df.eval('col2 * col3')).groupby('col1').agg('max') col2 col3 new_colcol1 1 5 -1 -12 9 0 0
This returns maximum for old and new columns but as always you can slice that.
df.assign(new_col=df.eval('col2 * col3')).groupby('col1')['new_col'].agg('max')col11 -12 0Name: new_col, dtype: int64
With groupby.apply
this would be shorter:
df.groupby('col1').apply(lambda x: (x.col2 * x.col3).max())col11 -12 0dtype: int64
However, groupby.apply
treats this as a custom function so it is not vectorized. Up to now, the functions we passed to agg
('min', 'max', 'min', 'size' etc.) are vectorized and these are aliases for those optimized functions. You can replace df.groupby('col1').agg('min')
with df.groupby('col1').agg(min)
, df.groupby('col1').agg(np.min)
or df.groupby('col1').min()
and they will all execute the same function. You will not see the same efficiency when you use custom functions.
Lastly, as of version 0.20, agg
can be used on DataFrames directly, without having to group first. See examples here.
Check the side-by-side comparison given by Pandas documentation here: http://pandas.pydata.org/pandas-docs/stable/comparison_with_r.html#grouping-and-summarizing
R's dplyr
gdf <- group_by(df, col1)summarise(gdf, avg=mean(col1, na.rm=TRUE))
Pandas
gdf = df.groupby('col1')df.groupby('col1').agg({'col1': 'mean'})
It is very easy to transition your R code into python code without learning APIs of pandas using datar
:
>>> from datar import f>>> from datar.tibble import tibble>>> from datar.dplyr import group_by, summarize>>> from datar.base import min, max>>> data = tibble(... col1=[1,1,1,1,1,2,2,2,2,2],... col2=[1,2,3,4,5,6,7,8,9,0],... col3=[-1,-2,-3,-4,-5,-6,-7,-8,-9,0]... )>>> data >> group_by(f.col1) >> summarize(col2_agg=max(f.col2), col3_agg=min(f.col3)) col1 col2_agg col3_agg0 1 5 -51 2 9 -9
I am the author of the package. Feel free to submit issues if you have any questions.