Is there easy way to grid search without cross validation in python?
Question:
There is absolutely helpful class GridSearchCV in scikit-learn to do grid search and cross validation, but I don’t want to do cross validataion. I want to do grid search without cross validation and use whole data to train.
To be more specific, I need to evaluate my model made by RandomForestClassifier with “oob score” during grid search.
Is there easy way to do it? or should I make a class by myself?
The points are
- I’d like to do grid search with easy way.
- I don’t want to do cross validation.
- I need to use whole data to train.(don’t want to separate to train data and test data)
- I need to use oob score to evaluate during grid search.
Answers:
I would really advise against using OOB to evaluate a model, but it is useful to know how to run a grid search outside of GridSearchCV()
(I frequently do this so I can save the CV predictions from the best grid for easy model stacking). I think the easiest way is to create your grid of parameters via ParameterGrid()
and then just loop through every set of params. For example assuming you have a grid dict, named “grid”, and RF model object, named “rf”, then you can do something like this:
for g in ParameterGrid(grid):
rf.set_params(**g)
rf.fit(X,y)
# save if best
if rf.oob_score_ > best_score:
best_score = rf.oob_score_
best_grid = g
print "OOB: %0.5f" % best_score
print "Grid:", best_grid
One method is to use ParameterGrid
to make a iterator of the parameters you want and loop over it.
Another thing you could do is actually configure the GridSearchCV to do what you want. I wouldn’t recommend this much because it’s unnecessarily complicated.
What you would need to do is:
- Use the arg
cv
from the docs and give it a generator which yields a tuple with all indices (so that train and test are same)
- Change the
scoring
arg to use the oob given out from the Random forest.
See this link:
https://stackoverflow.com/a/44682305/2202107
He used cv=[(slice(None), slice(None))]
which is NOT recommended by sklearn’s authors.
Although the question has been solved years ago, I just found a more natural way if you insist on using GridSearchCV() instead of other means (ParameterGrid(), etc.):
- Create a sklearn.model_selection.PredefinedSplit(). It takes a parameter called test_fold, which is a list and has the same size as your input data. In the list, you set all samples belonging to training set as -1 and others as 0.
- Create a GridSearchCV object with cv=”the created PredefinedSplit object”.
Then, GridSearchCV will generate only 1 train-validation split, which is defined in test_fold.
There is absolutely helpful class GridSearchCV in scikit-learn to do grid search and cross validation, but I don’t want to do cross validataion. I want to do grid search without cross validation and use whole data to train.
To be more specific, I need to evaluate my model made by RandomForestClassifier with “oob score” during grid search.
Is there easy way to do it? or should I make a class by myself?
The points are
- I’d like to do grid search with easy way.
- I don’t want to do cross validation.
- I need to use whole data to train.(don’t want to separate to train data and test data)
- I need to use oob score to evaluate during grid search.
I would really advise against using OOB to evaluate a model, but it is useful to know how to run a grid search outside of GridSearchCV()
(I frequently do this so I can save the CV predictions from the best grid for easy model stacking). I think the easiest way is to create your grid of parameters via ParameterGrid()
and then just loop through every set of params. For example assuming you have a grid dict, named “grid”, and RF model object, named “rf”, then you can do something like this:
for g in ParameterGrid(grid):
rf.set_params(**g)
rf.fit(X,y)
# save if best
if rf.oob_score_ > best_score:
best_score = rf.oob_score_
best_grid = g
print "OOB: %0.5f" % best_score
print "Grid:", best_grid
One method is to use ParameterGrid
to make a iterator of the parameters you want and loop over it.
Another thing you could do is actually configure the GridSearchCV to do what you want. I wouldn’t recommend this much because it’s unnecessarily complicated.
What you would need to do is:
- Use the arg
cv
from the docs and give it a generator which yields a tuple with all indices (so that train and test are same) - Change the
scoring
arg to use the oob given out from the Random forest.
See this link:
https://stackoverflow.com/a/44682305/2202107
He used cv=[(slice(None), slice(None))]
which is NOT recommended by sklearn’s authors.
Although the question has been solved years ago, I just found a more natural way if you insist on using GridSearchCV() instead of other means (ParameterGrid(), etc.):
- Create a sklearn.model_selection.PredefinedSplit(). It takes a parameter called test_fold, which is a list and has the same size as your input data. In the list, you set all samples belonging to training set as -1 and others as 0.
- Create a GridSearchCV object with cv=”the created PredefinedSplit object”.
Then, GridSearchCV will generate only 1 train-validation split, which is defined in test_fold.