Finding count of duplicate values and ordering in a Pandas dataframe
Say you do
agg = df.age.groupby([df['movie title']]).agg({'ave_age': 'mean', 'size': 'size'})
You'll get a DataFrame with columns ave_age
and size
.
agg[agg['size'] > 100]
will give you only those that have more than 100 users. From there, sort by agg.ave_age
and take the top 5. It should look something like this:
agg[agg['size'] > 100].sort_values(by='ave_age', ascending=True).head(5)
The filter creates a flag for each movie that is set to True if the movie title count is more than one hundred and False otherwise.
n = 100filter = (df.groupby(['movie title'])['age'] .transform(lambda group: group.count()) >= n)
Given the small size of your sample data, I will set n
to be 2 and create my filter.
Now I just filter on movies with a count exceeding n
, calculate the average age per group, and then take the five smallest (i.e. lowest age).
>>> df[filter.values].groupby('movie title').age.mean().nsmallest(5)movie titleTitle 2 12Title 3 13Name: age, dtype: int64