Passing categorical data to Sklearn Decision Tree

(This is just a reformat of my comment above from 2016...it still holds true.)

The accepted answer for this question is misleading.

As it stands, sklearn decision trees do not handle categorical data - see issue #5442.

The recommended approach of using Label Encoding converts to integers which the DecisionTreeClassifier() will treat as numeric. If your categorical data is not ordinal, this is not good - you'll end up with splits that do not make sense.

Using a OneHotEncoder is the only current valid way, allowing arbitrary splits not dependent on the label ordering, but is computationally expensive.

python scikit-learn decision-tree

(..)
Able to handle both numerical and categorical data.

This only means that you can use

the DecisionTreeClassifier class for classification problems
the DecisionTreeRegressor class for regression.

In any case you need to one-hot encode categorical variables before you fit a tree with sklearn, like so:

import pandas as pdfrom sklearn.tree import DecisionTreeClassifierdata = pd.DataFrame()data['A'] = ['a','a','b','a']data['B'] = ['b','b','a','b']data['C'] = [0, 0, 1, 0]data['Class'] = ['n','n','y','n']tree = DecisionTreeClassifier()one_hot_data = pd.get_dummies(data[['A','B','C']],drop_first=True)tree.fit(one_hot_data, data['Class'])

python scikit-learn decision-tree

For nominal categorical variables, I would not use LabelEncoderbut sklearn.preprocessing.OneHotEncoder or pandas.get_dummies instead because there is usually no order in these type of variables.

CodeHunter

Passing categorical data to Sklearn Decision Tree

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last