How do I resolve one hot encoding if my test data has missing values in a col? How do I resolve one hot encoding if my test data has missing values in a col? numpy numpy

How do I resolve one hot encoding if my test data has missing values in a col?


Guys don't do this mistake, please!

Yes, you can do this hack with the concatenation of train and test and fool yourself, but the real problem is in production. There your model will someday face an unknown level of your categorical variable and then break.

In reality, some of the more viable options could be:

  1. Retrain your model periodically to account for new data.
  2. Do not use one hot. Seriously, there are many better options like leave one out encoding (https://www.kaggle.com/c/caterpillar-tube-pricing/discussion/15748#143154) conditional probability encoding (https://medium.com/airbnb-engineering/designing-machine-learning-models-7d0048249e69), target encoding to name a few. Some classifiers like CatBoost even have a built-in mechanism for encoding, there are mature libraries like target_encoders in Python, where you will find lots of other options.
  3. Embed categorical features and this could save you from a complete retrain (http://flovv.github.io/Embeddings_with_keras/)


You can first combine two dataframes, then get_dummies then split them so they can have exact number of columns i.e

#Example Dataframes Xtrain = pd.DataFrame({'x':np.array([4,2,3,5,3,1])})Xtest = pd.DataFrame({'x':np.array([4,5,1,3])})# Concat with keys then get dummiestemp = pd.get_dummies(pd.concat([Xtrain,Xtest],keys=[0,1]), columns=['x'])# Selecting data from multi index and assigning them i.eXtrain,Xtest = temp.xs(0),temp.xs(1)# Xtrain.as_matrix()# array([[0, 0, 0, 1, 0],#        [0, 1, 0, 0, 0],#        [0, 0, 1, 0, 0],#        [0, 0, 0, 0, 1],#        [0, 0, 1, 0, 0],#        [1, 0, 0, 0, 0]], dtype=uint8)# Xtest.as_matrix()# array([[0, 0, 0, 1, 0],#        [0, 0, 0, 0, 1],#        [1, 0, 0, 0, 0],#        [0, 0, 1, 0, 0]], dtype=uint8)

Do not follow this approach. Its a simple trick with lot of disadvantages. @Vast Academician answer explains better.