Converting categorical values to binary using pandas Converting categorical values to binary using pandas pandas pandas

Converting categorical values to binary using pandas


You mean "one-hot" encoding?

Say you have the following dataset:

import pandas as pddf = pd.DataFrame([            ['green', 1, 10.1, 0],             ['red', 2, 13.5, 1],             ['blue', 3, 15.3, 0]])df.columns = ['color', 'size', 'prize', 'class label']df

enter image description here

Now, you have multiple options ...

A) The Tedious Approach

color_mapping = {           'green': (0,0,1),           'red': (0,1,0),           'blue': (1,0,0)}df['color'] = df['color'].map(color_mapping)df

enter image description here

import numpy as npy = df['class label'].valuesX = df.iloc[:, :-1].valuesX = np.apply_along_axis(func1d= lambda x: np.array(list(x[0]) + list(x[1:])), axis=1, arr=X)print('Class labels:', y)print('\nFeatures:\n', X)

Yielding:

Class labels: [0 1 0]Features: [[  0.    0.    1.    1.   10.1] [  0.    1.    0.    2.   13.5] [  1.    0.    0.    3.   15.3]]

B) Scikit-learn's DictVectorizer

from sklearn.feature_extraction import DictVectorizerdvec = DictVectorizer(sparse=False)X = dvec.fit_transform(df.transpose().to_dict().values())X

Yielding:

array([[  0. ,   0. ,   1. ,   0. ,  10.1,   1. ],       [  1. ,   0. ,   0. ,   1. ,  13.5,   2. ],       [  0. ,   1. ,   0. ,   0. ,  15.3,   3. ]])

C) Pandas' get_dummies

pd.get_dummies(df)

enter image description here


It seems that you are using scikit-learn's DictVectorizer to convert the categorical values to binary. In that case, to store the result along with the new column names, you can construct a new DataFrame with values from vec_x and columns from DV.get_feature_names(). Then, store the DataFrame to disk (e.g. with to_csv()) instead of the numpy array.

Alternatively, it is also possible to use pandas to do the encoding directly with the get_dummies function:

import pandas as pddata = pd.DataFrame({'T': ['A', 'B', 'C', 'D', 'E']})res = pd.get_dummies(data)res.to_csv('output.csv')print res

Output:

   T_A  T_B  T_C  T_D  T_E0    1    0    0    0    01    0    1    0    0    02    0    0    1    0    03    0    0    0    1    04    0    0    0    0    1