Using explicit (predefined) validation set for grid search with sklearn Using explicit (predefined) validation set for grid search with sklearn python python

Using explicit (predefined) validation set for grid search with sklearn


Use PredefinedSplit

ps = PredefinedSplit(test_fold=your_test_fold)

then set cv=ps in GridSearchCV

test_fold : “array-like, shape (n_samples,)

test_fold[i] gives the test set fold of sample i. A value of -1 indicates that the corresponding sample is not part of any test set folds, but will instead always be put into the training fold.

Also see here

when using a validation set, set the test_fold to 0 for all samples that are part of the validation set, and to -1 for all other samples.


Consider using the hypopt Python package (pip install hypopt) for which I am an author. It's a professional package created specifically for parameter optimization with a validation set. It works with any scikit-learn model out-of-the-box and can be used with Tensorflow, PyTorch, Caffe2, etc. as well.

# Code from https://github.com/cgnorthcutt/hypopt# Assuming you already have train, test, val sets and a model.from hypopt import GridSearchparam_grid = [  {'C': [1, 10, 100], 'kernel': ['linear']},  {'C': [1, 10, 100], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']}, ]# Grid-search all parameter combinations using a validation set.opt = GridSearch(model = SVR(), param_grid = param_grid)opt.fit(X_train, y_train, X_val, y_val)print('Test Score for Optimized Parameters:', opt.score(X_test, y_test))

EDIT: I (think I) received -1's on this response because I'm suggesting a package that I authored. This is unfortunate, given that the package was created specifically to solve this type of problem.


# Import Librariesfrom sklearn.model_selection import train_test_split, GridSearchCVfrom sklearn.model_selection import PredefinedSplit# Split Data to Train and ValidationX_train, X_val, y_train, y_val = train_test_split(X, y, train_size = 0.8, stratify = y,random_state = 2020)# Create a list where train data indices are -1 and validation data indices are 0split_index = [-1 if x in X_train.index else 0 for x in X.index]# Use the list to create PredefinedSplitpds = PredefinedSplit(test_fold = split_index)# Use PredefinedSplit in GridSearchCVclf = GridSearchCV(estimator = estimator,                   cv=pds,                   param_grid=param_grid)# Fit with all dataclf.fit(X, y)