scikit-learn GridSearchCV with multiple repetitions
This is called as nested cross_validation. You can look at official documentation example to guide you into right direction and also have a look at my other answer here for a similar approach.
You can adapt the steps to suit your need:
svr = SVC(kernel="rbf")c_grid = {"C": [1, 10, 100, ... ]}# CV Technique "LabelKFold", "LeaveOneOut", "LeaveOneLabelOut", etc.# To be used within GridSearch (5 in your case)inner_cv = KFold(n_splits=5, shuffle=True, random_state=i)# To be used in outer CV (you asked for 10)outer_cv = KFold(n_splits=10, shuffle=True, random_state=i)# Non_nested parameter search and scoringclf = GridSearchCV(estimator=svr, param_grid=c_grid, cv=inner_cv)clf.fit(X_iris, y_iris)non_nested_score = clf.best_score_# Pass the gridSearch estimator to cross_val_score# This will be your required 10 x 5 cvs# 10 for outer cv and 5 for gridSearch's internal CVclf = GridSearchCV(estimator=svr, param_grid=c_grid, cv=inner_cv)nested_score = cross_val_score(clf, X=X_iris, y=y_iris, cv=outer_cv).mean()
Edit - Description of nested cross validation with cross_val_score()
and GridSearchCV()
- clf = GridSearchCV(estimator, param_grid, cv= inner_cv).
- Pass
clf, X, y, outer_cv
tocross_val_score
- As seen in source code of cross_val_score, this
X
will be divided intoX_outer_train, X_outer_test
usingouter_cv
. Same for y. X_outer_test
will be held back andX_outer_train
will be passed on to clf for fit() (GridSearchCV in our case). AssumeX_outer_train
is calledX_inner
from here on since it is passed to inner estimator, assumey_outer_train
isy_inner
.X_inner
will now be split intoX_inner_train
andX_inner_test
usinginner_cv
in the GridSearchCV. Same for y- Now the gridSearch estimator will be trained using
X_inner_train
andy_train_inner
and scored usingX_inner_test
andy_inner_test
. - The steps 5 and 6 will be repeated for inner_cv_iters (5 in this case).
- The hyper-parameters for which the average score over all inner iterations
(X_inner_train, X_inner_test)
is best, is passed on to theclf.best_estimator_
and fitted for all data, i.e.X_outer_train
. - This
clf
(gridsearch.best_estimator_
) will then be scored usingX_outer_test
andy_outer_test
. - The steps 3 to 9 will be repeated for outer_cv_iters (10 here) and array of scores will returned from
cross_val_score
- We then use mean() to get back
nested_score
.
You can supply different cross-validation generators to GridSearchCV
. The default for binary or multiclass classification problems is StratifiedKFold
. Otherwise, it uses KFold
. But you can supply your own. In your case, it looks like you want RepeatedKFold
or RepeatedStratifiedKFold
.
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold# Define svr here...# Specify cross-validation generator, in this case (10 x 5CV)cv = RepeatedKFold(n_splits=5, n_repeats=10)clf = GridSearchCV(estimator=svr, param_grid=p_grid, cv=cv)# Continue as usualclf.fit(...)