Apply a cross validated ML model to unseen data

Question:

I would like to use scikit learn to predict with X a variable y. I would like to train a classifier on a training dataset using cross validation and then to apply this classifier to an unseen test dataset (as in https://www.nature.com/articles/s41586-022-04492-9)

from sklearn import datasets
from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

# Import dataset
X, y = datasets.load_iris(return_X_y=True)

# Create binary variable y
y[y == 0] = 1

# Divide in train and test set 
x_train, x_test, y_train, y_test = train_test_split(X, y,test_size=75, random_state=4, stratify=y)

# Cross validation on the train data  
cv_model = cross_validate(model, x_train, y_train, cv=5)

Now I would like to use this cross validated model and to apply it to the unseen test set. I am unable to find how.

It would be something like

result = cv_model.score(x_test, y_test)

Except this does not work

Asked By: salim

||

Answers:

You cannot do that; you need to fit the model before using it to predict new data. cross_validate is just a convenience function to get the scores; as clearly mentioned in the documentation, it returns just that, i.e. scores, and not a (fitted) model:

Evaluate metric(s) by cross-validation and also record fit/score times.

[…]

Returns: scores : dict of float arrays of shape (n_splits,)

Array of scores of the estimator for each run of the cross validation.

A dict of arrays containing the score/time arrays for each scorer is returned.

Answered By: desertnaut