sklearn mask for onehotencoder does not work

python numpy scikit-learn transformation one-hot-encoding

You should understand that all estimators in Scikit-Learn were designed only for numerical inputs. Thus from this point of view there is no sense to leave text column in this form. You have to transform that text column in something numerical, or get rid of it.

If you obtained your dataset from Pandas DataFrame - you can take a look at this small wrapper: https://github.com/paulgb/sklearn-pandas. It will help you to transform all needed columns simultaneously (or leave some of rows in numerical form)

import pandas as pdimport numpy as npfrom sklearn_pandas import DataFrameMapperfrom sklearn.preprocessing import OneHotEncoderdata = pd.DataFrame({'text':['aaa', 'bbb'], 'number_1':[1, 1], 'number_2':[2, 2]})#    number_1  number_2 text# 0         1         2  aaa# 1         1         2  bbb# SomeEncoder here must be any encoder which will help you to get# numerical representation from text columnmapper = DataFrameMapper([    ('text', SomeEncoder),    (['number_1', 'number_2'], OneHotEncoder())])mapper.fit_transform(data)

python numpy scikit-learn transformation one-hot-encoding

I think there's some confusion here. You still need to enter the numerical values, but within the encoder you can specify which values are categorical which are not.

The input to this transformer should be a matrix of integers, denoting the values taken on by categorical (discrete) features.

So in the example below I change aaa to 5 and bbb to 6. This way it will distinguish from the 1 and 2 numerical values:

d = np.array([[5, 1, 1], [6, 2, 2]])ohe = OneHotEncoder(categorical_features=np.array([True,False,False], dtype=bool))ohe.fit(d)

Now you can check your feature categories:

ohe.active_features_Out[22]: array([5, 6], dtype=int64)

python numpy scikit-learn transformation one-hot-encoding

I encountered the same behavior and found it frustrating. As others have pointed out, Scikit-Learn requires all data to be numerical before it even considers selecting the columns provided in the categorical_features parameter.

Specifically, the column selection is handled by the _transform_selected() method in /sklearn/preprocessing/data.py and the very first line of that method is

X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES).

This check fails if any of the data in the provided dataframe X cannot be successfully converted to a float.

I agree that the documentation of sklearn.preprocessing.OneHotEncoder is rather misleading in that regard.

CodeHunter

sklearn mask for onehotencoder does not work

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last