Impute categorical missing values in scikit-learn Impute categorical missing values in scikit-learn python python

Impute categorical missing values in scikit-learn


To use mean values for numeric columns and the most frequent value for non-numeric columns you could do something like this. You could further distinguish between integers and floats. I guess it might make sense to use the median for integer columns instead.

import pandas as pdimport numpy as npfrom sklearn.base import TransformerMixinclass DataFrameImputer(TransformerMixin):    def __init__(self):        """Impute missing values.        Columns of dtype object are imputed with the most frequent value         in column.        Columns of other types are imputed with mean of column.        """    def fit(self, X, y=None):        self.fill = pd.Series([X[c].value_counts().index[0]            if X[c].dtype == np.dtype('O') else X[c].mean() for c in X],            index=X.columns)        return self    def transform(self, X, y=None):        return X.fillna(self.fill)data = [    ['a', 1, 2],    ['b', 1, 1],    ['b', 2, 2],    [np.nan, np.nan, np.nan]]X = pd.DataFrame(data)xt = DataFrameImputer().fit_transform(X)print('before...')print(X)print('after...')print(xt)

which prints,

before...     0   1   20    a   1   21    b   1   12    b   2   23  NaN NaN NaNafter...   0         1         20  a  1.000000  2.0000001  b  1.000000  1.0000002  b  2.000000  2.0000003  b  1.333333  1.666667


You can use sklearn_pandas.CategoricalImputer for the categorical columns. Details:

First, (from the book Hands-On Machine Learning with Scikit-Learn and TensorFlow) you can have subpipelines for numerical and string/categorical features, where each subpipeline's first transformer is a selector that takes a list of column names (and the full_pipeline.fit_transform() takes a pandas DataFrame):

class DataFrameSelector(BaseEstimator, TransformerMixin):    def __init__(self, attribute_names):        self.attribute_names = attribute_names    def fit(self, X, y=None):        return self    def transform(self, X):        return X[self.attribute_names].values

You can then combine these sub pipelines with sklearn.pipeline.FeatureUnion, for example:

full_pipeline = FeatureUnion(transformer_list=[    ("num_pipeline", num_pipeline),    ("cat_pipeline", cat_pipeline)])

Now, in the num_pipeline you can simply use sklearn.preprocessing.Imputer(), but in the cat_pipline, you can use CategoricalImputer() from the sklearn_pandas package.

note: sklearn-pandas package can be installed with pip install sklearn-pandas, but it is imported as import sklearn_pandas


There is a package sklearn-pandas which has option for imputation for categorical variablehttps://github.com/scikit-learn-contrib/sklearn-pandas#categoricalimputer

>>> from sklearn_pandas import CategoricalImputer>>> data = np.array(['a', 'b', 'b', np.nan], dtype=object)>>> imputer = CategoricalImputer()>>> imputer.fit_transform(data)array(['a', 'b', 'b', 'b'], dtype=object)