How can I one hot encode in Python? How can I one hot encode in Python? pandas pandas

How can I one hot encode in Python?


Approach 1: You can use pandas' pd.get_dummies.

Example 1:

import pandas as pds = pd.Series(list('abca'))pd.get_dummies(s)Out[]:      a    b    c0  1.0  0.0  0.01  0.0  1.0  0.02  0.0  0.0  1.03  1.0  0.0  0.0

Example 2:

The following will transform a given column into one hot. Use prefix to have multiple dummies.

import pandas as pd        df = pd.DataFrame({          'A':['a','b','a'],          'B':['b','a','c']        })dfOut[]:    A  B0  a  b1  b  a2  a  c# Get one hot encoding of columns Bone_hot = pd.get_dummies(df['B'])# Drop column B as it is now encodeddf = df.drop('B',axis = 1)# Join the encoded dfdf = df.join(one_hot)df  Out[]:        A  a  b  c    0  a  0  1  0    1  b  1  0  0    2  a  0  0  1

Approach 2: Use Scikit-learn

Using a OneHotEncoder has the advantage of being able to fit on some training data and then transform on some other data using the same instance. We also have handle_unknown to further control what the encoder does with unseen data.

Given a dataset with three features and four samples, we let the encoder find the maximum value per feature and transform the data to a binary one-hot encoding.

>>> from sklearn.preprocessing import OneHotEncoder>>> enc = OneHotEncoder()>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])   OneHotEncoder(categorical_features='all', dtype=<class 'numpy.float64'>,   handle_unknown='error', n_values='auto', sparse=True)>>> enc.n_values_array([2, 3, 4])>>> enc.feature_indices_array([0, 2, 5, 9], dtype=int32)>>> enc.transform([[0, 1, 1]]).toarray()array([[ 1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.]])

Here is the link for this example: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html


Much easier to use Pandas for basic one-hot encoding. If you're looking for more options you can use scikit-learn.

For basic one-hot encoding with Pandas you pass your data frame into the get_dummies function.

For example, if I have a dataframe called imdb_movies:

enter image description here

...and I want to one-hot encode the Rated column, I do this:

pd.get_dummies(imdb_movies.Rated)

enter image description here

This returns a new dataframe with a column for every "level" of rating that exists, along with either a 1 or 0 specifying the presence of that rating for a given observation.

Usually, we want this to be part of the original dataframe. In this case, we attach our new dummy coded frame onto the original frame using "column-binding.

We can column-bind by using Pandas concat function:

rated_dummies = pd.get_dummies(imdb_movies.Rated)pd.concat([imdb_movies, rated_dummies], axis=1)

enter image description here

We can now run an analysis on our full dataframe.

SIMPLE UTILITY FUNCTION

I would recommend making yourself a utility function to do this quickly:

def encode_and_bind(original_dataframe, feature_to_encode):    dummies = pd.get_dummies(original_dataframe[[feature_to_encode]])    res = pd.concat([original_dataframe, dummies], axis=1)    return(res)

Usage:

encode_and_bind(imdb_movies, 'Rated')

Result:

enter image description here

Also, as per @pmalbu comment, if you would like the function to remove the original feature_to_encode then use this version:

def encode_and_bind(original_dataframe, feature_to_encode):    dummies = pd.get_dummies(original_dataframe[[feature_to_encode]])    res = pd.concat([original_dataframe, dummies], axis=1)    res = res.drop([feature_to_encode], axis=1)    return(res) 

You can encode multiple features at the same time as follows:

features_to_encode = ['feature_1', 'feature_2', 'feature_3',                      'feature_4']for feature in features_to_encode:    res = encode_and_bind(train_set, feature)


You can do it with numpy.eye and a using the array element selection mechanism:

import numpy as npnb_classes = 6data = [[2, 3, 4, 0]]def indices_to_one_hot(data, nb_classes):    """Convert an iterable of indices to one-hot encoded labels."""    targets = np.array(data).reshape(-1)    return np.eye(nb_classes)[targets]

The the return value of indices_to_one_hot(nb_classes, data) is now

array([[[ 0.,  0.,  1.,  0.,  0.,  0.],        [ 0.,  0.,  0.,  1.,  0.,  0.],        [ 0.,  0.,  0.,  0.,  1.,  0.],        [ 1.,  0.,  0.,  0.,  0.,  0.]]])

The .reshape(-1) is there to make sure you have the right labels format (you might also have [[2], [3], [4], [0]]).