Finding count of duplicate values and ordering in a Pandas dataframe

python pandas

Say you do

agg = df.age.groupby([df['movie title']]).agg({'ave_age': 'mean', 'size': 'size'})

You'll get a DataFrame with columns ave_age and size.

agg[agg['size'] > 100]

will give you only those that have more than 100 users. From there, sort by agg.ave_age and take the top 5. It should look something like this:

agg[agg['size'] > 100].sort_values(by='ave_age', ascending=True).head(5)

python pandas

The filter creates a flag for each movie that is set to True if the movie title count is more than one hundred and False otherwise.

n = 100filter = (df.groupby(['movie title'])['age']          .transform(lambda group: group.count()) >= n)

Given the small size of your sample data, I will set n to be 2 and create my filter.

Now I just filter on movies with a count exceeding n, calculate the average age per group, and then take the five smallest (i.e. lowest age).

>>> df[filter.values].groupby('movie title').age.mean().nsmallest(5)movie titleTitle 2    12Title 3    13Name: age, dtype: int64

CodeHunter