NotFittedError (instance is not fitted yet) after invoked cross_validate

Question:

This is my minimal reproducible example:

x = np.array([
   [1, 2],
   [3, 4],
   [5, 6],
   [6, 7]
])  
y = [1, 0, 0, 1]

model = GaussianNB()
scores = cross_validate(model, x, y, cv=2, scoring=("accuracy"))

model.predict([8,9])

What I intended to do is instantiating a Gaussian Naive Bayes Classifier and use sklearn.model_selection.cross_validate for cross validate my model (I am using cross_validate instead of cross_val_score since in my real project I need precision, recall and f1 as well).

I have read in the doc that cross_validate does "evaluate metric(s) by cross-validation and also record fit/score times."

I expected that my model would have been fitted on x (features), y (labels) data but when I invoke model.predict(.) I get:

sklearn.exceptions.NotFittedError: This GaussianNB instance is not fitted yet. Call ‘fit’ with appropriate arguments before using this estimator.

Of course it says me about invoking model.fit(x,y) before "using the estimator" (that is before invoking model.predict(.).

Shouldn’t the model have been fitted cv=2 times when I invoke cross_validate(...)?

Asked By: tail

||

Answers:

A close look at cross_validate documentation reveals that it includes an argument:

return_estimator : bool, default=False

Whether to return the estimators fitted on each split.

So, by default it will not return any fitted estimator (hence it cannot be used to predict).

In order to predict with the fitted estimator(s), you need to set the argument to True; but beware, you will not get a single fitted model, but a number of models equal to your cv parameter value (here 2):

import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_validate

x = np.array([
   [1, 2],
   [3, 4],
   [5, 6],
   [6, 7]
])  
y = [1, 0, 0, 1]

model = GaussianNB()
scores = cross_validate(model, x, y, cv=2, scoring=("accuracy"), return_estimator=True)
scores
# result:
{'fit_time': array([0.00124454, 0.00095725]),
 'score_time': array([0.00090432, 0.00054836]),
 'estimator': [GaussianNB(), GaussianNB()],
 'test_score': array([0.5, 0.5])}

So, in order to get predictions from each fitted model, you need:

scores['estimator'][0].predict([[8,9]])
# array([1])

scores['estimator'][1].predict([[8,9]])
# array([0])

This may look inconvenient, but it is like that by design: cross_validate is generally meant only to return the scores necessary for diagnosis and assessment, not to be used for fitting models which are to be used for predictions.

Answered By: desertnaut