Pandas: convert categories to numbers Pandas: convert categories to numbers python python

Pandas: convert categories to numbers


First, change the type of the column:

df.cc = pd.Categorical(df.cc)

Now the data look similar but are stored categorically. To capture the category codes:

df['code'] = df.cc.cat.codes

Now you have:

   cc  temp  code0  US  37.0     21  CA  12.0     12  US  35.0     23  AU  20.0     0

If you don't want to modify your DataFrame but simply get the codes:

df.cc.astype('category').cat.codes

Or use the categorical column as an index:

df2 = pd.DataFrame(df.temp)df2.index = pd.CategoricalIndex(df.cc)


If you wish only to transform your series into integer identifiers, you can use pd.factorize.

Note this solution, unlike pd.Categorical, will not sort alphabetically. So the first country will be assigned 0. If you wish to start from 1, you can add a constant:

df['code'] = pd.factorize(df['cc'])[0] + 1print(df)   cc  temp  code0  US  37.0     11  CA  12.0     22  US  35.0     13  AU  20.0     3

If you wish to sort alphabetically, specify sort=True:

df['code'] = pd.factorize(df['cc'], sort=True)[0] + 1 


If you are using the sklearn library you can use LabelEncoder. Like pd.Categorical, input strings are sorted alphabetically before encoding.

from sklearn.preprocessing import LabelEncoderLE = LabelEncoder()df['code'] = LE.fit_transform(df['cc'])print(df)   cc  temp  code0  US  37.0     21  CA  12.0     12  US  35.0     23  AU  20.0     0