Passing categorical data to Sklearn Decision Tree Passing categorical data to Sklearn Decision Tree python python

Passing categorical data to Sklearn Decision Tree


(This is just a reformat of my comment above from 2016...it still holds true.)

The accepted answer for this question is misleading.

As it stands, sklearn decision trees do not handle categorical data - see issue #5442.

The recommended approach of using Label Encoding converts to integers which the DecisionTreeClassifier() will treat as numeric. If your categorical data is not ordinal, this is not good - you'll end up with splits that do not make sense.

Using a OneHotEncoder is the only current valid way, allowing arbitrary splits not dependent on the label ordering, but is computationally expensive.


(..)

Able to handle both numerical and categorical data.

This only means that you can use

  • the DecisionTreeClassifier class for classification problems
  • the DecisionTreeRegressor class for regression.

In any case you need to one-hot encode categorical variables before you fit a tree with sklearn, like so:

import pandas as pdfrom sklearn.tree import DecisionTreeClassifierdata = pd.DataFrame()data['A'] = ['a','a','b','a']data['B'] = ['b','b','a','b']data['C'] = [0, 0, 1, 0]data['Class'] = ['n','n','y','n']tree = DecisionTreeClassifier()one_hot_data = pd.get_dummies(data[['A','B','C']],drop_first=True)tree.fit(one_hot_data, data['Class'])


For nominal categorical variables, I would not use LabelEncoderbut sklearn.preprocessing.OneHotEncoder or pandas.get_dummies instead because there is usually no order in these type of variables.