How do I filter a pandas DataFrame based on value counts?
Use groupby filter:
In [11]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B'])In [12]: dfOut[12]: A B0 1 21 1 42 5 6In [13]: df.groupby("A").filter(lambda x: len(x) > 1)Out[13]: A B0 1 21 1 4
I recommend reading the split-combine-section of the docs.
Solutions with better performance should be GroupBy.transform
with size
for count per groups to Series with same size like original df
, so possible filter by boolean indexing
:
df1 = df[df.groupby("A")['A'].transform('size') > 1]
Or use Series.map
with Series.value_counts
:
df1 = df[df['A'].map(df['A'].value_counts()) > 1]
@jezael solution works very well, Here is a different approach to filter based on values count :
For example, if the dataset is :
df = pd.DataFrame({'a': [1,2,3,3,1,6], 'b': [11,2,33,4,55,6]})
Convert and save the count as a dictionary
ount_freq = dict(df['a'].value_counts())
Create a new column and copy the target column, map the dictionary with newly created column
df['count_freq'] = df['a']df['count_freq'] = df['count_freq'].map(count_freq)
Now we have a new column with count freq, you can now define a threshold and filter easily with this column.
df[df.count_freq>1]