Can sklearn random forest directly handle categorical features?

python scikit-learn random-forest one-hot-encoding

No, there isn't. Somebody's working on this and the patch might be merged into mainline some day, but right now there's no support for categorical variables in scikit-learn except dummy (one-hot) encoding.

python scikit-learn random-forest one-hot-encoding

Most implementations of random forest (and many other machine learning algorithms) that accept categorical inputs are either just automating the encoding of categorical features for you or using a method that becomes computationally intractable for large numbers of categories.

A notable exception is H2O. H2O has a very efficient method for handling categorical data directly which often gives it an edge over tree based methods that require one-hot-encoding.

This article by Will McGinnis has a very good discussion of one-hot-encoding and alternatives.

This article by Nick Dingwall and Chris Potts has a very good discussion about categorical variables and tree based learners.

python scikit-learn random-forest one-hot-encoding

You have to make the categorical variable into a series of dummy variables. Yes I know its annoying and seems unnecessary but that is how sklearn works. if you are using pandas. use pd.get_dummies, it works really well.

CodeHunter

Can sklearn random forest directly handle categorical features?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last