XGBoost Categorical Variables: Dummification vs encoding

xgboost only deals with numeric columns.

if you have a feature [a,b,b,c] which describes a categorical variable (i.e. no numeric relationship)

Using LabelEncoder you will simply have this:

array([0, 1, 1, 2])

Xgboost will wrongly interpret this feature as having a numeric relationship! This just maps each string ('a','b','c') to an integer, nothing more.

Proper way

Using OneHotEncoder you will eventually get to this:

array([[ 1.,  0.,  0.],       [ 0.,  1.,  0.],       [ 0.,  1.,  0.],       [ 0.,  0.,  1.]])

This is the proper representation of a categorical variable for xgboost or any other machine learning tool.

Pandas get_dummies is a nice tool for creating dummy variables (which is easier to use, in my opinion).

Method #2 in above question will not represent the data properly

python categorical-data xgboost

I want to answer this question not just in terms of XGBoost but in terms of any problem dealing with categorical data. While "dummification" creates a very sparse setup, specially if you have multiple categorical columns with different levels, label encoding is often biased as the mathematical representation is not reflective of the relationship between levels.

For Binary Classification problems, a genius yet unexplored approach which is highly leveraged in traditional credit scoring models is to use Weight of Evidence to replace the categorical levels. Basically every categorical level is replaced by the proportion of Goods/ Proportion of Bads.

Can read more about it here.

Python library here.

This method allows you to capture the "levels" under one column and avoid sparsity or induction of bias that would occur through dummifying or encoding.

Hope this helps !

python categorical-data xgboost

Nov 23, 2020

XGBoost has since version 1.3.0 added experimental support for categorical features. From the docs:

1.8.7 Categorical Data
Other than users performing encoding, XGBoost has experimental supportfor categorical data using gpu_hist and gpu_predictor. No specialoperation needs to be done on input test data since the informationabout categories is encoded into the model during training.

https://buildmedia.readthedocs.org/media/pdf/xgboost/latest/xgboost.pdf

In the DMatrix section the docs also say:

enable_categorical (boolean, optional) – New in version 1.3.0.
Experimental support of specializing for categorical features. Do notset to True unless you are interested in development. Currently it’sonly available for gpu_hist tree method with 1 vs rest (one hot)categorical split. Also, JSON serialization format, gpu_predictor andpandas input are required.

CodeHunter

XGBoost Categorical Variables: Dummification vs encoding

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last