How to properly pickle sklearn pipeline when using custom transformer How to properly pickle sklearn pipeline when using custom transformer python python

How to properly pickle sklearn pipeline when using custom transformer


I found a pretty straightforward solution. Assuming you are using Jupyter notebooks for training:

  1. Create a .py file where the custom transformer is defined and import it to the Jupyter notebook.

This is the file custom_transformer.py

from sklearn.pipeline import TransformerMixinclass FilterOutBigValuesTransformer(TransformerMixin):    def __init__(self):        pass    def fit(self, X, y=None):        self.biggest_value = X.c1.max()        return self    def transform(self, X):        return X.loc[X.c1 <= self.biggest_value]
  1. Train your model importing this class from the .py file and save it using joblib.
import joblibfrom custom_transformer import FilterOutBigValuesTransformerfrom sklearn.externals import joblibfrom sklearn.preprocessing import MinMaxScalerpipeline = Pipeline([    ('filter', FilterOutBigValuesTransformer()),    ('encode', MinMaxScaler()),])X=load_some_pandas_dataframe()pipeline.fit(X)joblib.dump(pipeline, 'pipeline.pkl')
  1. When loading the .pkl file in a different python script, you will have to import the .py file in order to make it work:
import joblibfrom utils import custom_transformer # decided to save it in a utils directorypipeline = joblib.load('pipeline.pkl')


I have created a workaround solution. I do not consider it a complete answer to my question, but non the less it let me move on from my problem.

Conditions for the workaround to work:

I. Pipeline needs to have only 2 kinds of transformers:

  1. sklearn transformers
  2. custom transformers, but with only attributes of types:
    • number
    • string
    • list
    • dict

or any combination of those e.g. list of dicts with strings and numbers. Generally important thing is that attributes are json serializable.

II. names of pipeline steps need to be unique (even if there is pipeline nesting)


In short model would be stored as a catalog with joblib dumped files, a json file for custom transformers, and a json file with other info about model.

I have created a function that goes through steps of a pipeline and checks __module__ attribute of transformer.

If it finds sklearn in it it then it runs joblib.dump function under a name specified in steps (first element of step tuple), to some selected model catalog.

Otherwise (no sklearn in __module__) it adds __dict__ of transformer to result_dict under a key equal to name specified in steps. At the end I json.dump the result_dict to model catalog under name result_dict.json.

If there is a need to go into some transformer, because e.g. there is a Pipeline inside a pipeline, you can probably run this function recursively by adding some rules to the beginning of the function, but it becomes important to have always unique steps/transformers names even between main pipeline and subpipelines.

If there are other information needed for creation of model pipeline then save them in model_info.json.


Then if you want to load the model for usage:You need to create (without fitting) the same pipeline in target project. If pipeline creation is somewhat dynamic, and you need information from source project, then load it from model_info.json.

You can copy function used for serialization and:

  • replace all joblib.dump with joblib.load statements, assign __dict__ from loaded object to __dict__ of object already in pipeline
  • replace all places where you added __dict__ to result_dict with assignment of appropriate value from result_dict to object __dict__ (remember to load result_dict from file beforehand)

After running this modified function, previously unfitted pipeline should have all transformer attributes that were effect of fitting loaded, and pipeline as a whole ready to predict.

The main things I do not like about this solution is that it needs pipeline code inside target project, and needs all attrs of custom transformers to be json serializable, but I leave it here for other people that stumble on a similar problem, maybe somebody comes up with something better.


Based on my research it seems that the best solution is to create a Python package that includes your trained pipeline and all files.

Then you can pip install it in the project where you want to use it and import the pipeline with from <package name> import <pipeline name>.