When to use Category rather than Object?
Use a category when there is lots of repetition that you expect to exploit.
For example, suppose I want the aggregate size per exchange for a large table of trades. Using the default object
is totally reasonable:
In [6]: %timeit trades.groupby('exch')['size'].sum()1000 loops, best of 3: 1.25 ms per loop
But since the list of possible exchanges is pretty small, and because there is lots of repetition, I could make this faster by using a category
:
In [7]: trades['exch'] = trades['exch'].astype('category')In [8]: %timeit trades.groupby('exch')['size'].sum()1000 loops, best of 3: 702 µs per loop
Note that categories are really a form of dynamic enumeration. They are most useful if the range of possible values is fixed and finite.
The Pandas documentation has a concise section on when to use the categorical
data type:
The categorical data type is useful in the following cases:
- A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory, see here.
- The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting to a categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of the lexical order, see here.
- As a signal to other Python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types).