Pandas: convert categories to numbers
First, change the type of the column:
df.cc = pd.Categorical(df.cc)
Now the data look similar but are stored categorically. To capture the category codes:
df['code'] = df.cc.cat.codes
Now you have:
cc temp code0 US 37.0 21 CA 12.0 12 US 35.0 23 AU 20.0 0
If you don't want to modify your DataFrame but simply get the codes:
df.cc.astype('category').cat.codes
Or use the categorical column as an index:
df2 = pd.DataFrame(df.temp)df2.index = pd.CategoricalIndex(df.cc)
If you wish only to transform your series into integer identifiers, you can use pd.factorize
.
Note this solution, unlike pd.Categorical
, will not sort alphabetically. So the first country will be assigned 0
. If you wish to start from 1
, you can add a constant:
df['code'] = pd.factorize(df['cc'])[0] + 1print(df) cc temp code0 US 37.0 11 CA 12.0 22 US 35.0 13 AU 20.0 3
If you wish to sort alphabetically, specify sort=True
:
df['code'] = pd.factorize(df['cc'], sort=True)[0] + 1
If you are using the sklearn
library you can use LabelEncoder
. Like pd.Categorical
, input strings are sorted alphabetically before encoding.
from sklearn.preprocessing import LabelEncoderLE = LabelEncoder()df['code'] = LE.fit_transform(df['cc'])print(df) cc temp code0 US 37.0 21 CA 12.0 12 US 35.0 23 AU 20.0 0