Concatenate encoded columns to original data frame using Scikit-learn and Pandas
There are a couple of methods to do this. Assuming you want to encode the independent variables you can use pd.get_dummies with the drop_first=True included. Here is an example:
import pandas as pd# Create a data of independent variables X for the exampleX = pd.DataFrame({'Country':['China', 'India', 'USA', 'Indonesia', 'Brasil'], 'Continent': ['Asia', 'Asia', 'North America', 'Asia', 'South America'], 'Population, M': [1403.5, 1324.2, 322.2, 261.1, 207.6]})print(X)# EncodecolumnsToEncode=X.select_dtypes(include=[object]).columnsX = pd.get_dummies(X, columns=columnsToEncode, drop_first=True)print(X)# X prior to encoding Continent Country Population, M0 Asia China 1403.51 Asia India 1324.22 North America USA 322.23 Asia Indonesia 261.14 South America Brasil 207.6# X after encoding Population, M Continent_North America Continent_South America \0 1403.5 0 0 1 1324.2 0 0 2 322.2 1 0 3 261.1 0 0 4 207.6 0 1 Country_China Country_India Country_Indonesia Country_USA 0 1 0 0 0 1 0 1 0 0 2 0 0 0 1 3 0 0 1 0 4 0 0 0 0
If I am understanding correctly here, you are looking to encode the columns and have them back in a dataframe format.One way of doing this could be :
Convert your df into a matrix.
df_array = df.as_matrix(columns=['A','B','C'])
Perform the encoding:
from sklearn import preprocessing le = preprocessing.LabelEncoder() for i in range(len(df.columns)): df_array[:,i] = le.fit_transform(df_array[:,i])
For the OneHotEncoder :
enc = OneHotEncoder()enc.fit(df_array) OHE_array=enc.transform(df_array).toarray()
However, this OHE can increase the dimensionality in a big way.So you may want to perform PCA or some sort of dimensionality reduction techniques to apply computationally feasible algorithms.
If you want it back in the dataframe format:
newdf=pd.DataFrame(df_array, columns=['A','B','C'])