Get statistics for each group (such as count, mean, etc) using pandas GroupBy?

python pandas dataframe group-by pandas-groupby

Quick Answer:

The simplest way to get row counts per group is by calling .size(), which returns a Series:

df.groupby(['col1','col2']).size()

Usually you want this result as a DataFrame (instead of a Series) so you can do:

df.groupby(['col1', 'col2']).size().reset_index(name='counts')

If you want to find out how to calculate the row counts and other statistics for each group continue reading below.

Detailed example:

Consider the following example dataframe:

In [2]: dfOut[2]:   col1 col2  col3  col4  col5  col60    A    B  0.20 -0.61 -0.49  1.491    A    B -1.53 -1.01 -0.39  1.822    A    B -0.44  0.27  0.72  0.113    A    B  0.28 -1.32  0.38  0.184    C    D  0.12  0.59  0.81  0.665    C    D -0.13 -1.65 -1.64  0.506    C    D -1.42 -0.11 -0.18 -0.447    E    F -0.00  1.42 -0.26  1.178    E    F  0.91 -0.47  1.35 -0.349    G    H  1.48 -0.63 -1.14  0.17

First let's use .size() to get the row counts:

In [3]: df.groupby(['col1', 'col2']).size()Out[3]: col1  col2A     B       4C     D       3E     F       2G     H       1dtype: int64

Then let's use .size().reset_index(name='counts') to get the row counts:

In [4]: df.groupby(['col1', 'col2']).size().reset_index(name='counts')Out[4]:   col1 col2  counts0    A    B       41    C    D       32    E    F       23    G    H       1

Including results for more statistics

When you want to calculate statistics on grouped data, it usually looks like this:

In [5]: (df   ...: .groupby(['col1', 'col2'])   ...: .agg({   ...:     'col3': ['mean', 'count'],    ...:     'col4': ['median', 'min', 'count']   ...: }))Out[5]:             col4                  col3                median   min count      mean countcol1 col2                                   A    B    -0.810 -1.32     4 -0.372500     4C    D    -0.110 -1.65     3 -0.476667     3E    F     0.475 -0.47     2  0.455000     2G    H    -0.630 -0.63     1  1.480000     1

The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.

To gain more control over the output I usually split the statistics into individual aggregations that I then combine using join. It looks like this:

In [6]: gb = df.groupby(['col1', 'col2'])   ...: counts = gb.size().to_frame(name='counts')   ...: (counts   ...:  .join(gb.agg({'col3': 'mean'}).rename(columns={'col3': 'col3_mean'}))   ...:  .join(gb.agg({'col4': 'median'}).rename(columns={'col4': 'col4_median'}))   ...:  .join(gb.agg({'col4': 'min'}).rename(columns={'col4': 'col4_min'}))   ...:  .reset_index()   ...: )   ...: Out[6]:   col1 col2  counts  col3_mean  col4_median  col4_min0    A    B       4  -0.372500       -0.810     -1.321    C    D       3  -0.476667       -0.110     -1.652    E    F       2   0.455000        0.475     -0.473    G    H       1   1.480000       -0.630     -0.63

Footnotes

The code used to generate the test data is shown below:

In [1]: import numpy as np   ...: import pandas as pd    ...:    ...: keys = np.array([   ...:         ['A', 'B'],   ...:         ['A', 'B'],   ...:         ['A', 'B'],   ...:         ['A', 'B'],   ...:         ['C', 'D'],   ...:         ['C', 'D'],   ...:         ['C', 'D'],   ...:         ['E', 'F'],   ...:         ['E', 'F'],   ...:         ['G', 'H']    ...:         ])   ...:    ...: df = pd.DataFrame(   ...:     np.hstack([keys,np.random.randn(10,4).round(2)]),    ...:     columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']   ...: )   ...:    ...: df[['col3', 'col4', 'col5', 'col6']] = \   ...:     df[['col3', 'col4', 'col5', 'col6']].astype(float)   ...:

Disclaimer:

If some of the columns that you are aggregating have null values, then you really want to be looking at the group row counts as an independent aggregation for each column. Otherwise you may be misled as to how many records are actually being used to calculate things like the mean because pandas will drop NaN entries in the mean calculation without telling you about it.

python pandas dataframe group-by pandas-groupby

On groupby object, the agg function can take a list to apply several aggregation methods at once. This should give you the result you need:

df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).agg(['mean', 'count'])

python pandas dataframe group-by pandas-groupby

Swiss Army Knife: `GroupBy.describe`

Returns count, mean, std, and other useful statistics per-group.

df.groupby(['A', 'B'])['C'].describe()           count  mean   std   min   25%   50%   75%   maxA   B                                                     bar one      1.0  0.40   NaN  0.40  0.40  0.40  0.40  0.40    three    1.0  2.24   NaN  2.24  2.24  2.24  2.24  2.24    two      1.0 -0.98   NaN -0.98 -0.98 -0.98 -0.98 -0.98foo one      2.0  1.36  0.58  0.95  1.15  1.36  1.56  1.76    three    1.0 -0.15   NaN -0.15 -0.15 -0.15 -0.15 -0.15    two      2.0  1.42  0.63  0.98  1.20  1.42  1.65  1.87

To get specific statistics, just select them,

df.groupby(['A', 'B'])['C'].describe()[['count', 'mean']]           count      meanA   B                     bar one      1.0  0.400157    three    1.0  2.240893    two      1.0 -0.977278foo one      2.0  1.357070    three    1.0 -0.151357    two      2.0  1.423148

describe works for multiple columns (change ['C'] to ['C', 'D']—or remove it altogether—and see what happens, the result is a MultiIndexed columned dataframe).

You also get different statistics for string data. Here's an example,

df2 = df.assign(D=list('aaabbccc')).sample(n=100, replace=True)with pd.option_context('precision', 2):    display(df2.groupby(['A', 'B'])               .describe(include='all')               .dropna(how='all', axis=1))              C                                                   D                          count  mean       std   min   25%   50%   75%   max count unique top freqA   B                                                                              bar one    14.0  0.40  5.76e-17  0.40  0.40  0.40  0.40  0.40    14      1   a   14    three  14.0  2.24  4.61e-16  2.24  2.24  2.24  2.24  2.24    14      1   b   14    two     9.0 -0.98  0.00e+00 -0.98 -0.98 -0.98 -0.98 -0.98     9      1   c    9foo one    22.0  1.43  4.10e-01  0.95  0.95  1.76  1.76  1.76    22      2   a   13    three  15.0 -0.15  0.00e+00 -0.15 -0.15 -0.15 -0.15 -0.15    15      1   c   15    two    26.0  1.49  4.48e-01  0.98  0.98  1.87  1.87  1.87    26      2   b   15

For more information, see the documentation.

pandas >= 1.1: `DataFrame.value_counts`

This is available from pandas 1.1 if you just want to capture the size of every group, this cuts out the GroupBy and is faster.

df.value_counts(subset=['col1', 'col2'])

Minimal Example

# Setupnp.random.seed(0)df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',                          'foo', 'bar', 'foo', 'foo'],                   'B' : ['one', 'one', 'two', 'three',                          'two', 'two', 'one', 'three'],                   'C' : np.random.randn(8),                   'D' : np.random.randn(8)})df.value_counts(['A', 'B']) A    B    foo  two      2     one      2     three    1bar  two      1     three    1     one      1dtype: int64

Other Statistical Analysis Tools

If you didn't find what you were looking for above, the User Guide has a comprehensive listing of supported statical analysis, correlation, and regression tools.

CodeHunter

Get statistics for each group (such as count, mean, etc) using pandas GroupBy?

Quick Answer:

Detailed example:

Including results for more statistics

Footnotes

Swiss Army Knife: `GroupBy.describe`

pandas >= 1.1: `DataFrame.value_counts`

Other Statistical Analysis Tools

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last

Get statistics for each group (such as count, mean, etc) using pandas GroupBy?

Quick Answer:

Detailed example:

Including results for more statistics

Footnotes

Swiss Army Knife: GroupBy.describe

pandas >= 1.1: DataFrame.value_counts

Recent Posts

Swiss Army Knife: `GroupBy.describe`

pandas >= 1.1: `DataFrame.value_counts`