Finding max occurrence of a column's value, after group-by on another column
You can try double groupby
with size
and idxmax
. Output is list of tuples (because MultiIndex
), so use apply
:
df = df.groupby(['id','city']).size().groupby(level=0).idxmax() .apply(lambda x: x[1]).reset_index(name='city')
Another solutions:
s = df.groupby(['id','city']).size()df = s.loc[s.groupby(level=0).idxmax()].reset_index().drop(0,axis=1)
Or:
df = df.groupby(['id'])['city'].apply(lambda x: x.value_counts().index[0]).reset_index()
print (df) id city0 000.tushar@gmail.com Bangalore1 00078r@gmail.com Vijayawada2 0007ayan@gmail.com Jamshedpur
The recommended approach is groupby('id').apply(your_custom_function)
, where your_custom_function aggregates by 'city' and returns the max value (or as you mentioned, multiple max values). We don't even have to use .agg('city')
import pandas as pddef get_top_city(g): return g['city'].value_counts().idxmax() df = pd.DataFrame.from_records( [('000.tushar@gmail.com', 'Bangalore'), ('00078r@gmail.com', 'Mumbai'), ('0007ayan@gmail.com', 'Jamshedpur'),('0007ayan@gmail.com', 'Jamshedpur'), ('000.tushar@gmail.com', 'Bangalore'), ('00078r@gmail.com', 'Mumbai'), ('00078r@gmail.com', 'Vijayawada'),('00078r@gmail.com', 'Vijayawada'), ('00078r@gmail.com', 'Vijayawada')], columns=['id','city'], index=None )topdf = df.groupby('id').apply(get_top_city)id000.tushar@gmail.com Bangalore00078r@gmail.com Vijayawada0007ayan@gmail.com Jamshedpur# or topdf.items()/iteritems() if you want as list of (id,city) tuples[('000.tushar@gmail.com', 'Bangalore'), ('00078r@gmail.com', 'Vijayawada'), ('0007ayan@gmail.com', 'Jamshedpur')]