GroupBy pandas DataFrame and select most common value GroupBy pandas DataFrame and select most common value python python

GroupBy pandas DataFrame and select most common value


You can use value_counts() to get a count series, and get the first row:

import pandas as pdsource = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'],                   'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],                  'Short name' : ['NY','New','Spb','NY']})source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])

In case you are wondering about performing other agg functions in the .agg()try this.

# Let's add a new col,  accountsource['account'] = [1,2,3,3]source.groupby(['Country','City']).agg(mod  = ('Short name', \                                        lambda x: x.value_counts().index[0]),                                        avg = ('account', 'mean') \                                      )


Pandas >= 0.16

pd.Series.mode is available!

Use groupby, GroupBy.agg, and apply the pd.Series.mode function to each group:

source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)Country  City            Russia   Sankt-Petersburg    SpbUSA      New-York             NYName: Short name, dtype: object

If this is needed as a DataFrame, use

source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode).to_frame()                         Short nameCountry City                       Russia  Sankt-Petersburg        SpbUSA     New-York                 NY

The useful thing about Series.mode is that it always returns a Series, making it very compatible with agg and apply, especially when reconstructing the groupby output. It is also faster.

# Accepted answer.%timeit source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])# Proposed in this post.%timeit source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)5.56 ms ± 343 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)2.76 ms ± 387 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Dealing with Multiple Modes

Series.mode also does a good job when there are multiple modes:

source2 = source.append(    pd.Series({'Country': 'USA', 'City': 'New-York', 'Short name': 'New'}),    ignore_index=True)# Now `source2` has two modes for the # ("USA", "New-York") group, they are "NY" and "New".source2  Country              City Short name0     USA          New-York         NY1     USA          New-York        New2  Russia  Sankt-Petersburg        Spb3     USA          New-York         NY4     USA          New-York        New

source2.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)Country  City            Russia   Sankt-Petersburg          SpbUSA      New-York            [NY, New]Name: Short name, dtype: object

Or, if you want a separate row for each mode, you can use GroupBy.apply:

source2.groupby(['Country','City'])['Short name'].apply(pd.Series.mode)Country  City               Russia   Sankt-Petersburg  0    SpbUSA      New-York          0     NY                           1    NewName: Short name, dtype: object

If you don't care which mode is returned as long as it's either one of them, then you will need a lambda that calls mode and extracts the first result.

source2.groupby(['Country','City'])['Short name'].agg(    lambda x: pd.Series.mode(x)[0])Country  City            Russia   Sankt-Petersburg    SpbUSA      New-York             NYName: Short name, dtype: object

Alternatives to (not) consider

You can also use statistics.mode from python, but...

source.groupby(['Country','City'])['Short name'].apply(statistics.mode)Country  City            Russia   Sankt-Petersburg    SpbUSA      New-York             NYName: Short name, dtype: object

...it does not work well when having to deal with multiple modes; a StatisticsError is raised. This is mentioned in the docs:

If data is empty, or if there is not exactly one most common value, StatisticsError is raised.

But you can see for yourself...

statistics.mode([1, 2])# ---------------------------------------------------------------------------# StatisticsError                           Traceback (most recent call last)# ...# StatisticsError: no unique mode; found 2 equally common values


For agg, the lambba function gets a Series, which does not have a 'Short name' attribute.

stats.mode returns a tuple of two arrays, so you have to take the first element of the first array in this tuple.

With these two simple changements:

source.groupby(['Country','City']).agg(lambda x: stats.mode(x)[0][0])

returns

                         Short nameCountry City                       Russia  Sankt-Petersburg        SpbUSA     New-York                 NY