GroupBy pandas DataFrame and select most common value
You can use value_counts()
to get a count series, and get the first row:
import pandas as pdsource = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'], 'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'], 'Short name' : ['NY','New','Spb','NY']})source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])
In case you are wondering about performing other agg functions in the .agg()try this.
# Let's add a new col, accountsource['account'] = [1,2,3,3]source.groupby(['Country','City']).agg(mod = ('Short name', \ lambda x: x.value_counts().index[0]), avg = ('account', 'mean') \ )
Pandas >= 0.16
pd.Series.mode
is available!
Use groupby
, GroupBy.agg
, and apply the pd.Series.mode
function to each group:
source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)Country City Russia Sankt-Petersburg SpbUSA New-York NYName: Short name, dtype: object
If this is needed as a DataFrame, use
source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode).to_frame() Short nameCountry City Russia Sankt-Petersburg SpbUSA New-York NY
The useful thing about Series.mode
is that it always returns a Series, making it very compatible with agg
and apply
, especially when reconstructing the groupby output. It is also faster.
# Accepted answer.%timeit source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])# Proposed in this post.%timeit source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)5.56 ms ± 343 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)2.76 ms ± 387 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Dealing with Multiple Modes
Series.mode
also does a good job when there are multiple modes:
source2 = source.append( pd.Series({'Country': 'USA', 'City': 'New-York', 'Short name': 'New'}), ignore_index=True)# Now `source2` has two modes for the # ("USA", "New-York") group, they are "NY" and "New".source2 Country City Short name0 USA New-York NY1 USA New-York New2 Russia Sankt-Petersburg Spb3 USA New-York NY4 USA New-York New
source2.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)Country City Russia Sankt-Petersburg SpbUSA New-York [NY, New]Name: Short name, dtype: object
Or, if you want a separate row for each mode, you can use GroupBy.apply
:
source2.groupby(['Country','City'])['Short name'].apply(pd.Series.mode)Country City Russia Sankt-Petersburg 0 SpbUSA New-York 0 NY 1 NewName: Short name, dtype: object
If you don't care which mode is returned as long as it's either one of them, then you will need a lambda that calls mode
and extracts the first result.
source2.groupby(['Country','City'])['Short name'].agg( lambda x: pd.Series.mode(x)[0])Country City Russia Sankt-Petersburg SpbUSA New-York NYName: Short name, dtype: object
Alternatives to (not) consider
You can also use statistics.mode
from python, but...
source.groupby(['Country','City'])['Short name'].apply(statistics.mode)Country City Russia Sankt-Petersburg SpbUSA New-York NYName: Short name, dtype: object
...it does not work well when having to deal with multiple modes; a StatisticsError
is raised. This is mentioned in the docs:
If data is empty, or if there is not exactly one most common value, StatisticsError is raised.
But you can see for yourself...
statistics.mode([1, 2])# ---------------------------------------------------------------------------# StatisticsError Traceback (most recent call last)# ...# StatisticsError: no unique mode; found 2 equally common values
For agg
, the lambba function gets a Series
, which does not have a 'Short name'
attribute.
stats.mode
returns a tuple of two arrays, so you have to take the first element of the first array in this tuple.
With these two simple changements:
source.groupby(['Country','City']).agg(lambda x: stats.mode(x)[0][0])
returns
Short nameCountry City Russia Sankt-Petersburg SpbUSA New-York NY