Concatenate encoded columns to original data frame using Scikit-learn and Pandas Concatenate encoded columns to original data frame using Scikit-learn and Pandas pandas pandas

Concatenate encoded columns to original data frame using Scikit-learn and Pandas


There are a couple of methods to do this. Assuming you want to encode the independent variables you can use pd.get_dummies with the drop_first=True included. Here is an example:

import pandas as pd# Create a data of independent variables X for the exampleX = pd.DataFrame({'Country':['China', 'India', 'USA', 'Indonesia', 'Brasil'],                   'Continent': ['Asia', 'Asia', 'North America', 'Asia', 'South America'],                   'Population, M': [1403.5, 1324.2, 322.2, 261.1, 207.6]})print(X)# EncodecolumnsToEncode=X.select_dtypes(include=[object]).columnsX = pd.get_dummies(X, columns=columnsToEncode, drop_first=True)print(X)# X prior to encoding       Continent    Country  Population, M0           Asia      China         1403.51           Asia      India         1324.22  North America        USA          322.23           Asia  Indonesia          261.14  South America     Brasil          207.6# X after encoding   Population, M  Continent_North America  Continent_South America  \0         1403.5                        0                        0   1         1324.2                        0                        0   2          322.2                        1                        0   3          261.1                        0                        0   4          207.6                        0                        1      Country_China  Country_India  Country_Indonesia  Country_USA  0              1              0                  0            0  1              0              1                  0            0  2              0              0                  0            1  3              0              0                  1            0  4              0              0                  0            0


If I am understanding correctly here, you are looking to encode the columns and have them back in a dataframe format.One way of doing this could be :

Convert your df into a matrix.

df_array = df.as_matrix(columns=['A','B','C'])

Perform the encoding:

from sklearn import preprocessing  le = preprocessing.LabelEncoder()    for i in range(len(df.columns)):        df_array[:,i] = le.fit_transform(df_array[:,i])

For the OneHotEncoder :

enc = OneHotEncoder()enc.fit(df_array)      OHE_array=enc.transform(df_array).toarray()

However, this OHE can increase the dimensionality in a big way.So you may want to perform PCA or some sort of dimensionality reduction techniques to apply computationally feasible algorithms.

If you want it back in the dataframe format:

 newdf=pd.DataFrame(df_array, columns=['A','B','C'])