sklearn pipeline - how to apply different transformations on different columns sklearn pipeline - how to apply different transformations on different columns python python

sklearn pipeline - how to apply different transformations on different columns


The way I usually do it is with a FeatureUnion, using a FunctionTransformer to pull out the relevant columns.

Important notes:

  • You have to define your functions with def since annoyingly you can't use lambda or partial in FunctionTransformer if you want to pickle your model

  • You need to initialize FunctionTransformer with validate=False

Something like this:

from sklearn.pipeline import make_union, make_pipelinefrom sklearn.preprocessing import FunctionTransformerdef get_text_cols(df):    return df[['name', 'fruit']]def get_num_cols(df):    return df[['height','age']]vec = make_union(*[    make_pipeline(FunctionTransformer(get_text_cols, validate=False), LabelEncoder()))),    make_pipeline(FunctionTransformer(get_num_cols, validate=False), MinMaxScaler())))])


Since v0.20, you can use ColumnTransformer to accomplish this.


An Example of ColumnTransformer might help you:

# FOREGOING TRANSFORMATIONS ON 'data' ...# filter datadata = data[data['county'].isin(COUNTIES_OF_INTEREST)]# define the feature encoding of the dataimpute_and_one_hot_encode = Pipeline([        ('impute', SimpleImputer(strategy='most_frequent')),        ('encode', OneHotEncoder(sparse=False, handle_unknown='ignore'))    ])featurisation = ColumnTransformer(transformers=[    ("impute_and_one_hot_encode", impute_and_one_hot_encode, ['smoker', 'county', 'race']),    ('word2vec', MyW2VTransformer(min_count=2), ['last_name']),    ('numeric', StandardScaler(), ['num_children', 'income'])])# define the training pipeline for the modelneural_net = KerasClassifier(build_fn=create_model, epochs=10, batch_size=1, verbose=0, input_dim=109)pipeline = Pipeline([    ('features', featurisation),    ('learner', neural_net)])# train-test splittrain_data, test_data = train_test_split(data, random_state=0)# model trainingmodel = pipeline.fit(train_data, train_data['label'])

You can find the entire code under: https://github.com/stefan-grafberger/mlinspect/blob/19ca0d6ae8672249891835190c9e2d9d3c14f28f/example_pipelines/healthcare/healthcare.py