Closest equivalent of a factor variable in Python Pandas Closest equivalent of a factor variable in Python Pandas r r

Closest equivalent of a factor variable in Python Pandas


This question seems to be from a year back but since it is still open here's an update. pandas has introduced a categorical dtype and it operates very similar to factors in R. Please see this link for more information:

http://pandas-docs.github.io/pandas-docs-travis/categorical.html

Reproducing a snippet from the link above showing how to create a "factor" variable in pandas.

In [1]: s = Series(["a","b","c","a"], dtype="category")In [2]: sOut[2]: 0    a1    b2    c3    adtype: categoryCategories (3, object): [a < b < c]


If you're looking to do modeling etc, lots of goodies for factor within the patsy library. I will admit to having struggled with this myself. I found these slides helpful. Wish I could give a better example, but this is as far as I've gotten myself.


If you're looking to map a categorical variable to a number as R does, Pandas implemented a function that will give you just that: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.factorize.html

import pandas as pddf = pd.read_csv('path_to_your_file')df['new_factor'], _ = pd.factorize(df['old_categorical'], sort=True)

This function returns both the enumerated mapping as well as a list of unique values. If you're just doing variable assignment, you'll have to throw the latter away as above.

If you want a homegrown solution, you can use a combination of a set and a dictionary within a function. This method is a bit easier to apply over multiple columns, but you do have to note that None, NaN, etc. will be a included as a category in this method:

def factor(var):    var_set = set(var)    var_set = {x: y for x, y in [pair for pair in zip(var_set, range(len(var_set)))]}    return [var_set[x] for x in var]df['new_factor1'] = df['old_categorical1'].apply(factor)df[['new_factor2', 'new_factor3']] = df[['old_categorical2', 'old_categorical3']].apply(factor)