Using Scikit-Learn OneHotEncoder with a Pandas DataFrame Using Scikit-Learn OneHotEncoder with a Pandas DataFrame pandas pandas

Using Scikit-Learn OneHotEncoder with a Pandas DataFrame


OneHotEncoder Encodes categorical integer features as a one-hot numeric array. Its Transform method returns a sparse matrix if sparse=True, otherwise it returns a 2-d array.

You can't cast a 2-d array (or sparse matrix) into a Pandas Series. You must create a Pandas Serie (a column in a Pandas dataFrame) for each category.

I would recommend pandas.get_dummies instead:

data = pd.get_dummies(data,prefix=['Profession'], columns = ['Profession'], drop_first=True)

EDIT:

Using Sklearn OneHotEncoder:

transformed = jobs_encoder.transform(data['Profession'].to_numpy().reshape(-1, 1))#Create a Pandas DataFrame of the hot encoded columnohe_df = pd.DataFrame(transformed, columns=jobs_encoder.get_feature_names())#concat with original datadata = pd.concat([data, ohe_df], axis=1).drop(['Profession'], axis=1)

Other Options: If you are doing hyperparameter tuning with GridSearch it's recommanded to use ColumnTransformer and FeatureUnion with Pipeline or directly make_column_transformer


So turned out that Scikit-Learns LabelBinarizer gave me better luck in converting the data to one-hot encoded format, with help from Amnie's solution, my final code is as follows

import pandas as pdfrom sklearn.preprocessing import LabelBinarizerjobs_encoder = LabelBinarizer()jobs_encoder.fit(data['Profession'])transformed = jobs_encoder.transform(data['Profession'])ohe_df = pd.DataFrame(transformed)data = pd.concat([data, ohe_df], axis=1).drop(['Profession'], axis=1)


This below is an approach suggested by Kaggle Learn. Do not think there is a simpler way to do so at the moment to go from an original pandas DataFrame to a one-hot encoded DataFrame.

# Apply one-hot encoder to each column with categorical dataOH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[low_cardinality_cols]))OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[low_cardinality_cols]))# One-hot encoding removed index; put it backOH_cols_train.index = X_train.indexOH_cols_valid.index = X_valid.index# Remove categorical columns (will replace with one-hot encoding)num_X_train = X_train.drop(object_cols, axis=1)num_X_valid = X_valid.drop(object_cols, axis=1)# Add one-hot encoded columns to numerical featuresOH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)print(OH_X_train)