Use sklearn's GridSearchCV with a pipeline, preprocessing just once

python numpy machine-learning scikit-learn grid-search

Update:Ideally, the answer below should not be used as it leads to data leakage as discussed in comments. In this answer, GridSearchCV will tune the hyperparameters on the data already preprocessed by StandardScaler, which is not correct. In most conditions that should not matter much, but algorithms which are too sensitive to scaling will give wrong results.

Essentially, GridSearchCV is also an estimator, implementing fit() and predict() methods, used by the pipeline.

So instead of:

grid = GridSearchCV(make_pipeline(StandardScaler(), LogisticRegression()),                    param_grid={'logisticregression__C': [0.1, 10.]},                    cv=2,                    refit=False)

Do this:

clf = make_pipeline(StandardScaler(),                     GridSearchCV(LogisticRegression(),                                 param_grid={'logisticregression__C': [0.1, 10.]},                                 cv=2,                                 refit=True))clf.fit()clf.predict()

What it will do is, call the StandardScalar() only once, for one call to clf.fit() instead of multiple calls as you described.

Edit:

Changed refit to True, when GridSearchCV is used inside a pipeline. As mentioned in documentation:

refit : boolean, default=True Refit the best estimator with the entire dataset. If “False”, it is impossible to make predictions using this GridSearchCV instance after fitting.

If refit=False, clf.fit() will have no effect because the GridSearchCV object inside the pipeline will be reinitialized after fit().When refit=True, the GridSearchCV will be refitted with the best scoring parameter combination on the whole data that is passed in fit().

So if you want to make the pipeline, just to see the scores of the grid search, only then the refit=False is appropriate. If you want to call the clf.predict() method, refit=True must be used, else Not Fitted error will be thrown.

python numpy machine-learning scikit-learn grid-search

For those who stumbled upon a little bit different problem, that I had as well.

Suppose you have this pipeline:

classifier = Pipeline([    ('vectorizer', CountVectorizer(max_features=100000, ngram_range=(1, 3))),    ('clf', RandomForestClassifier(n_estimators=10, random_state=SEED, n_jobs=-1))])

Then, when specifying parameters you need to include this 'clf_' name that you used for your estimator. So the parameters grid is going to be:

params={'clf__max_features':[0.3, 0.5, 0.7],        'clf__min_samples_leaf':[1, 2, 3],        'clf__max_depth':[None]        }

python numpy machine-learning scikit-learn grid-search

It is not possible to do this in the current version of scikit-learn (0.18.1). A fix has been proposed on the github project:

https://github.com/scikit-learn/scikit-learn/issues/8830

https://github.com/scikit-learn/scikit-learn/pull/8322

CodeHunter

Use sklearn's GridSearchCV with a pipeline, preprocessing just once

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last