One hot encoding of string categorical features
If you are on sklearn>0.20.dev0
In [11]: from sklearn.preprocessing import OneHotEncoder ...: cat = OneHotEncoder() ...: X = np.array([['a', 'b', 'a', 'c'], [0, 1, 0, 1]], dtype=object).T ...: cat.fit_transform(X).toarray() ...: Out[11]: array([[1., 0., 0., 1., 0.], [0., 1., 0., 0., 1.], [1., 0., 0., 1., 0.], [0., 0., 1., 0., 1.]])
If you are on sklearn==0.20.dev0
In [30]: cat = CategoricalEncoder()In [31]: X = np.array([['a', 'b', 'a', 'c'], [0, 1, 0, 1]], dtype=object).TIn [32]: cat.fit_transform(X).toarray()Out[32]:array([[ 1., 0., 0., 1., 0.], [ 0., 1., 0., 0., 1.], [ 1., 0., 0., 1., 0.], [ 0., 0., 1., 0., 1.]])
Another way to do it is to use category_encoders.
Here is an example:
% pip install category_encodersimport category_encoders as cele = ce.OneHotEncoder(return_df=False, impute_missing=False, handle_unknown="ignore")X = np.array([['a', 'dog', 'red'], ['b', 'cat', 'green']])le.fit_transform(X)array([[1, 0, 1, 0, 1, 0], [0, 1, 0, 1, 0, 1]])
Very nice question.
However, in some sense, it is a private case of something that comes up (at least for me) rather often - given sklearn
stages applicable to subsets of the X
matrix, I'd like to apply (possibly several) given the entire matrix. Here, for example, you have a stage which knows to run on a single column, and you'd like to apply it thrice - once per column.
This is a classic case for using the Composite Design Pattern.
Here is a (sketch of a) reusable stage that accepts a dictionary mapping a column index into the transformation to apply to it:
class ColumnApplier(object): def __init__(self, column_stages): self._column_stages = column_stages def fit(self, X, y): for i, k in self._column_stages.items(): k.fit(X[:, i]) return self def transform(self, X): X = X.copy() for i, k in self._column_stages.items(): X[:, i] = k.transform(X[:, i]) return X
Now, to use it in this context, starting with
X = np.array([['a', 'dog', 'red'], ['b', 'cat', 'green']])y = np.array([1, 2])X
you would just use it to map each column index to the transformation you want:
multi_encoder = \ ColumnApplier(dict([(i, preprocessing.LabelEncoder()) for i in range(3)]))multi_encoder.fit(X, None).transform(X)
Once you develop such a stage (I can't post the one I use), you can use it over and over for various settings.
I've faced this problem many times and I found a solution in this book at his page 100 :
We can apply both transformations (from text categories to integer categories, then from integer categories to one-hot vectors) in one shot using the LabelBinarizer class:
and the sample code is here :
from sklearn.preprocessing import LabelBinarizerencoder = LabelBinarizer()housing_cat_1hot = encoder.fit_transform(data)housing_cat_1hot
and as a result :Note that this returns a dense NumPy array by default. You can get a sparse matrix instead by passingsparse_output=True to the LabelBinarizer constructor.
And you can find more about the LabelBinarizer, here in the sklearn official documentation