GridSearchCV & RandomizedSearchCV – do you refit the model after running

Question:

I have some test and train data, the test data does not have any dependant variables.

I’m currently running a GridSearchCV or RandomizedSearchCV to find the best paramaters.

Should I pass all of my “test” X & y values into a GridSearchCV or RandomizedSearchCV?

I understand it does a cross validation, so I believe its fine to?

But if this is the case, what data has the best_estimator been fit with? All of it? Or data from one of the folds?

Do I need to refit the full set of test data after?

Asked By: Lewis Morris

Source

Answers:

There are quite a lot of questions being asked here, I will try and answer one by one.

Should I pass all of my "test" X & y values into a GridSearchCV or RandomizedSearchCV?

You mentioned that you don’t have the dependent variable for your test data, in that case, you cannot pass it to your model. Even if you have access to the values of the dependent variable you should not send them to your GridSearchCV or RandomSearchCV, these methods will internally create a validation set on which your model is tested for each hyperparameter setting.

what data has the best_estimator been fit with?

It depends on how you have initialized your GridSearchCV or RandomizedSearchCV object, both these methods have a parameter called refit which when set to TRUE (by default) will refit the model with entire data.

Do I need to refit the full set of test data after?

Generally, you don’t use your test data to tune your hyperparameters. You do it using the validation set, and once you have frozen your model, you use the test set to check the performance of the model which will be an unbiased estimation of the model performance.

Answered By: Parthasarathy Subburaj

Nothing can stop you from using your test dataset to find optimal hyperparameters to your model. However, after doing this you can’t really tell how well your model generalizes, i.e. behaves on unseen data, because you used the test set to tune the model, making it useless for measuring the performance of the model.

Also I believe Cross Validated would be a better place to ask such questions.

Answered By: Tomasz Bartkowiak