Impute categorical missing values in scikit-learn
To use mean values for numeric columns and the most frequent value for non-numeric columns you could do something like this. You could further distinguish between integers and floats. I guess it might make sense to use the median for integer columns instead.
import pandas as pdimport numpy as npfrom sklearn.base import TransformerMixinclass DataFrameImputer(TransformerMixin): def __init__(self): """Impute missing values. Columns of dtype object are imputed with the most frequent value in column. Columns of other types are imputed with mean of column. """ def fit(self, X, y=None): self.fill = pd.Series([X[c].value_counts().index[0] if X[c].dtype == np.dtype('O') else X[c].mean() for c in X], index=X.columns) return self def transform(self, X, y=None): return X.fillna(self.fill)data = [ ['a', 1, 2], ['b', 1, 1], ['b', 2, 2], [np.nan, np.nan, np.nan]]X = pd.DataFrame(data)xt = DataFrameImputer().fit_transform(X)print('before...')print(X)print('after...')print(xt)
which prints,
before... 0 1 20 a 1 21 b 1 12 b 2 23 NaN NaN NaNafter... 0 1 20 a 1.000000 2.0000001 b 1.000000 1.0000002 b 2.000000 2.0000003 b 1.333333 1.666667
You can use sklearn_pandas.CategoricalImputer
for the categorical columns. Details:
First, (from the book Hands-On Machine Learning with Scikit-Learn and TensorFlow) you can have subpipelines for numerical and string/categorical features, where each subpipeline's first transformer is a selector that takes a list of column names (and the full_pipeline.fit_transform()
takes a pandas DataFrame):
class DataFrameSelector(BaseEstimator, TransformerMixin): def __init__(self, attribute_names): self.attribute_names = attribute_names def fit(self, X, y=None): return self def transform(self, X): return X[self.attribute_names].values
You can then combine these sub pipelines with sklearn.pipeline.FeatureUnion
, for example:
full_pipeline = FeatureUnion(transformer_list=[ ("num_pipeline", num_pipeline), ("cat_pipeline", cat_pipeline)])
Now, in the num_pipeline
you can simply use sklearn.preprocessing.Imputer()
, but in the cat_pipline
, you can use CategoricalImputer()
from the sklearn_pandas
package.
note: sklearn-pandas
package can be installed with pip install sklearn-pandas
, but it is imported as import sklearn_pandas
There is a package sklearn-pandas
which has option for imputation for categorical variablehttps://github.com/scikit-learn-contrib/sklearn-pandas#categoricalimputer
>>> from sklearn_pandas import CategoricalImputer>>> data = np.array(['a', 'b', 'b', np.nan], dtype=object)>>> imputer = CategoricalImputer()>>> imputer.fit_transform(data)array(['a', 'b', 'b', 'b'], dtype=object)