pandas Series.value_counts returns inconsistent order for equal count strings
You have a few options to sort consistently given a series:
s = pd.Series(['a', 'b', 'a', 'c', 'c'])c = s.value_counts()
sort by index
Use pd.Series.sort_index
:
res = c.sort_index()a 2b 1c 2dtype: int64
sort by count (arbitrary for ties)
For descending counts, do nothing, as this is the default. Otherwise, you can use pd.Series.sort_values
, which defaults to ascending=True
. In either case, you should make no assumptions on how ties are handled.
res = c.sort_values()b 1c 2a 2dtype: int64
More efficiently, you can use c.iloc[::-1]
to reverse the order.
sort by count and then by index
You can use numpy.lexsort
to sort by count and then by index. Note the reverse order, i.e. -c.values
is used first for sorting.
res = c.iloc[np.lexsort((c.index, -c.values))]a 2c 2b 1dtype: int64
You could use sort_index
:
print(df.value_counts().sort_index())
Output:
a 1b 1dtype: int64
Please see the documentation if you want to use parameters (like ascending=True
etc.)
sort_index
vs reindex(df.unique())
(as suggested by @Wen) seem to be perform quite similar:
df.value_counts().sort_index(): 1000 loops, best of 3: 636 µs per loopdf.value_counts().reindex(df.unique()): 1000 loops, best of 3: 880 µs per loop