Using explicit (predefined) validation set for grid search with sklearn

Question:

I have a dataset, which has previously been split into 3 sets: train, validation and test. These sets have to be used as given in order to compare the performance across different algorithms.

I would now like to optimize the parameters of my SVM using the validation set. However, I cannot find how to input the validation set explicitly into sklearn.grid_search.GridSearchCV(). Below is some code I’ve previously used for doing K-fold cross-validation on the training set. However, for this problem I need to use the validation set as given. How can I do that?

from sklearn import svm, cross_validation
from sklearn.grid_search import GridSearchCV

# (some code left out to simplify things)

skf = cross_validation.StratifiedKFold(y_train, n_folds=5, shuffle = True)
clf = GridSearchCV(svm.SVC(tol=0.005, cache_size=6000,
                             class_weight=penalty_weights),
                     param_grid=tuned_parameters,
                     n_jobs=2,
                     pre_dispatch="n_jobs",
                     cv=skf,
                     scoring=scorer)
clf.fit(X_train, y_train)
Asked By: pir

||

Answers:

Use PredefinedSplit

ps = PredefinedSplit(test_fold=your_test_fold)

then set cv=ps in GridSearchCV

test_fold : “array-like, shape (n_samples,)

test_fold[i] gives the test set fold of sample i. A value of -1 indicates that the corresponding sample is not part of any test set folds, but will instead always be put into the training fold.

Also see here

when using a validation set, set the test_fold to 0 for all samples that are part of the validation set, and to -1 for all other samples.

Answered By: yangjie

Consider using the hypopt Python package (pip install hypopt) for which I am an author. It’s a professional package created specifically for parameter optimization with a validation set. It works with any scikit-learn model out-of-the-box and can be used with Tensorflow, PyTorch, Caffe2, etc. as well.

# Code from https://github.com/cgnorthcutt/hypopt
# Assuming you already have train, test, val sets and a model.
from hypopt import GridSearch
param_grid = [
  {'C': [1, 10, 100], 'kernel': ['linear']},
  {'C': [1, 10, 100], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
 ]
# Grid-search all parameter combinations using a validation set.
opt = GridSearch(model = SVR(), param_grid = param_grid)
opt.fit(X_train, y_train, X_val, y_val)
print('Test Score for Optimized Parameters:', opt.score(X_test, y_test))

EDIT: I (think I) received -1’s on this response because I’m suggesting a package that I authored. This is unfortunate, given that the package was created specifically to solve this type of problem.

Answered By: cgnorthcutt
# Import Libraries
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.model_selection import PredefinedSplit

# Split Data to Train and Validation
X_train, X_val, y_train, y_val = train_test_split(X, y, train_size = 0.8, stratify = y,random_state = 2020)

# Create a list where train data indices are -1 and validation data indices are 0
split_index = [-1 if x in X_train.index else 0 for x in X.index]

# Use the list to create PredefinedSplit
pds = PredefinedSplit(test_fold = split_index)

# Use PredefinedSplit in GridSearchCV
clf = GridSearchCV(estimator = estimator,
                   cv=pds,
                   param_grid=param_grid)

# Fit with all data
clf.fit(X, y)

To add to the original answer, when the train-valid-test split is not done with Scikit-learn’s train_test_split() function, i.e., the dataframes are already split manually beforehand and scaled/normalized so as to prevent leakage from training data, the numpy arrays can be concatenated.

import numpy as np
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)

from sklearn.model_selection import PredefinedSplit, GridSearchCV
split_index = [-1]*len(X_train) + [0]*len(X_val)
X = np.concatenate((X_train, X_val), axis=0)
y = np.concatenate((y_train, y_val), axis=0)
pds = PredefinedSplit(test_fold = split_index)

clf = GridSearchCV(estimator = estimator,
                   cv=pds,
                   param_grid=param_grid)

# Fit with all data
clf.fit(X, y)
Answered By: arjunsiva

I wanted to provide some reproducible code that creates a validation split using the last 20% of observations.


from sklearn import datasets
from sklearn.model_selection import PredefinedSplit, GridSearchCV
from sklearn.ensemble import GradientBoostingRegressor

# load data
df_train = datasets.fetch_california_housing(as_frame=True).data
y = datasets.fetch_california_housing().target

param_grid = {"max_depth": [5, 6],
              'learning_rate': [0.03, 0.06],
              'subsample': [.5, .75]
             }

model = GradientBoostingRegressor()

# Create a single validation split
val_prop = .2
n_val_rows = round(len(df_train) * val_prop)
val_starting_index = len(df_train) - n_val_rows
cv = PredefinedSplit([-1 if i < val_starting_index else 0 for i in df_train.index])


# Use PredefinedSplit in GridSearchCV
results = GridSearchCV(estimator = model,
                       cv=cv,
                       param_grid=param_grid,
                       verbose=True,
                       n_jobs=-1)

# Fit with all data
results.fit(df_train, y)

results.best_params_
Answered By: KG in Chicago

The cv argument of the SearchCV i.e. Grid or Random can just be an iterable of indices too for train and validation split i.e. cv=((train_idcs, val_idcs),).

Note that the data on which the search classifier will be fit should be the train+val set and the indices specified will be used by the sklearn to separate them internally. Additionally, when working with dataframes, the indices specified should be accessible as ilocs, so reset indices (don’t drop them if they will be required later).

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import (
    train_test_split,
    RandomizedSearchCV,
)

data = load_iris(as_frame=True)["frame"]

# These indices will serves as explicit and predefined split
train_idcs, val_idcs = train_test_split(
    data.index, 
    random_state=42, 
    stratify=data.target,
)

param_grid = dict(
    n_estimators=[50,100,150,200],
    max_samples=[0.85,0.9,0.95,1],
    max_depth=[3,5,7,10],
    max_features=["sqrt", "log2", 0.85, 0.9, 0.95, 1],
)

search_clf = RandomizedSearchCV(
    estimator=RandomForestClassifier(),
    param_distributions=param_grid,
    n_iter=50,
    cv=((train_idcs, val_idcs),), # explicit predefined split in terms of indices
    random_state=42,
)

# X is the first 4 columns i.e. the sepal and petal widths and lengths
# and y is the 5th column i.e. target column
search_clf.fit(X=data.iloc[:,:4], y=data.target)

Also, be mindful if you want to refit on the whole data or only on the train data and thus retrain the classifier using the best fit parameters accordingly.