One hot encoding of string categorical features

python encoding scikit-learn one-hot-encoding

If you are on sklearn>0.20.dev0

In [11]: from sklearn.preprocessing import OneHotEncoder    ...: cat = OneHotEncoder()    ...: X = np.array([['a', 'b', 'a', 'c'], [0, 1, 0, 1]], dtype=object).T    ...: cat.fit_transform(X).toarray()    ...: Out[11]: array([[1., 0., 0., 1., 0.],           [0., 1., 0., 0., 1.],           [1., 0., 0., 1., 0.],           [0., 0., 1., 0., 1.]])

If you are on sklearn==0.20.dev0

In [30]: cat = CategoricalEncoder()In [31]: X = np.array([['a', 'b', 'a', 'c'], [0, 1, 0, 1]], dtype=object).TIn [32]: cat.fit_transform(X).toarray()Out[32]:array([[ 1.,  0., 0.,  1.,  0.],       [ 0.,  1.,  0.,  0.,  1.],       [ 1.,  0.,  0.,  1.,  0.],       [ 0.,  0.,  1.,  0.,  1.]])

Another way to do it is to use category_encoders.

Here is an example:

% pip install category_encodersimport category_encoders as cele =  ce.OneHotEncoder(return_df=False, impute_missing=False, handle_unknown="ignore")X = np.array([['a', 'dog', 'red'], ['b', 'cat', 'green']])le.fit_transform(X)array([[1, 0, 1, 0, 1, 0],       [0, 1, 0, 1, 0, 1]])

python encoding scikit-learn one-hot-encoding

Very nice question.

However, in some sense, it is a private case of something that comes up (at least for me) rather often - given sklearn stages applicable to subsets of the X matrix, I'd like to apply (possibly several) given the entire matrix. Here, for example, you have a stage which knows to run on a single column, and you'd like to apply it thrice - once per column.

This is a classic case for using the Composite Design Pattern.

Here is a (sketch of a) reusable stage that accepts a dictionary mapping a column index into the transformation to apply to it:

class ColumnApplier(object):    def __init__(self, column_stages):        self._column_stages = column_stages    def fit(self, X, y):        for i, k in self._column_stages.items():            k.fit(X[:, i])        return self    def transform(self, X):        X = X.copy()        for i, k in self._column_stages.items():            X[:, i] = k.transform(X[:, i])        return X

Now, to use it in this context, starting with

X = np.array([['a', 'dog', 'red'], ['b', 'cat', 'green']])y = np.array([1, 2])X

you would just use it to map each column index to the transformation you want:

multi_encoder = \    ColumnApplier(dict([(i, preprocessing.LabelEncoder()) for i in range(3)]))multi_encoder.fit(X, None).transform(X)

Once you develop such a stage (I can't post the one I use), you can use it over and over for various settings.

python encoding scikit-learn one-hot-encoding

I've faced this problem many times and I found a solution in this book at his page 100 :

We can apply both transformations (from text categories to integer categories, then from integer categories to one-hot vectors) in one shot using the LabelBinarizer class:

and the sample code is here :

from sklearn.preprocessing import LabelBinarizerencoder = LabelBinarizer()housing_cat_1hot = encoder.fit_transform(data)housing_cat_1hot

and as a result :Note that this returns a dense NumPy array by default. You can get a sparse matrix instead by passingsparse_output=True to the LabelBinarizer constructor.

And you can find more about the LabelBinarizer, here in the sklearn official documentation

CodeHunter

One hot encoding of string categorical features

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last