Converting categorical values to binary using pandas
You mean "one-hot" encoding?
Say you have the following dataset:
import pandas as pddf = pd.DataFrame([ ['green', 1, 10.1, 0], ['red', 2, 13.5, 1], ['blue', 3, 15.3, 0]])df.columns = ['color', 'size', 'prize', 'class label']df
Now, you have multiple options ...
A) The Tedious Approach
color_mapping = { 'green': (0,0,1), 'red': (0,1,0), 'blue': (1,0,0)}df['color'] = df['color'].map(color_mapping)df
import numpy as npy = df['class label'].valuesX = df.iloc[:, :-1].valuesX = np.apply_along_axis(func1d= lambda x: np.array(list(x[0]) + list(x[1:])), axis=1, arr=X)print('Class labels:', y)print('\nFeatures:\n', X)
Yielding:
Class labels: [0 1 0]Features: [[ 0. 0. 1. 1. 10.1] [ 0. 1. 0. 2. 13.5] [ 1. 0. 0. 3. 15.3]]
B) Scikit-learn's DictVectorizer
from sklearn.feature_extraction import DictVectorizerdvec = DictVectorizer(sparse=False)X = dvec.fit_transform(df.transpose().to_dict().values())X
Yielding:
array([[ 0. , 0. , 1. , 0. , 10.1, 1. ], [ 1. , 0. , 0. , 1. , 13.5, 2. ], [ 0. , 1. , 0. , 0. , 15.3, 3. ]])
C) Pandas' get_dummies
pd.get_dummies(df)
It seems that you are using scikit-learn's DictVectorizer
to convert the categorical values to binary. In that case, to store the result along with the new column names, you can construct a new DataFrame with values from vec_x
and columns from DV.get_feature_names()
. Then, store the DataFrame to disk (e.g. with to_csv()
) instead of the numpy array.
Alternatively, it is also possible to use pandas
to do the encoding directly with the get_dummies
function:
import pandas as pddata = pd.DataFrame({'T': ['A', 'B', 'C', 'D', 'E']})res = pd.get_dummies(data)res.to_csv('output.csv')print res
Output:
T_A T_B T_C T_D T_E0 1 0 0 0 01 0 1 0 0 02 0 0 1 0 03 0 0 0 1 04 0 0 0 0 1