Why does not GridSearchCV give best score ? – Scikit Learn
Question:
I have a dataset with 158 rows and 10 columns. I try to build multiple linear regression model and try to predict future value.
I used GridSearchCV for tunning parameters.
Here is my GridSearchCV and Regression function :
def GridSearch(data):
X_train, X_test, y_train, y_test = cross_validation.train_test_split(data, ground_truth_data, test_size=0.3, random_state = 0)
parameters = {'fit_intercept':[True,False], 'normalize':[True,False], 'copy_X':[True, False]}
model = linear_model.LinearRegression()
grid = GridSearchCV(model,parameters)
grid.fit(X_train, y_train)
predictions = grid.predict(X_test)
print "Grid best score: ", grid.best_score_
print "Grid score function: ", grid.score(X_test,y_test)
Output of this code is :
Grid best score: 0.720298870251
Grid score function: 0.888263112299
My question is what is the difference between best_score_
and score
function ?
How the score
function can be better than the best_score
function ?
Thanks in advance.
Answers:
The best_score_
is the best score from the cross-validation. That is, the model is fit on part of the training data, and the score is computed by predicting the rest of the training data. This is because you passed X_train
and y_train
to fit
; the fit
process thus does not know anything about your test set, only your training set.
The score
method of the model object scores the model on the data you give it. You passed X_test
and y_test
, so this call computes the score of the fit (i.e., tuned) model on the test set.
In short, the two scores are calculated on different data sets, so it shouldn’t be surprising that they are different.
If I get your question right, you want to check the performance of your model right?. Is your dataset linear or nonlinear?. If it is nonlinear dataset, then you cannot base on the R squared alone. The residual plot can be of great help
I have a dataset with 158 rows and 10 columns. I try to build multiple linear regression model and try to predict future value.
I used GridSearchCV for tunning parameters.
Here is my GridSearchCV and Regression function :
def GridSearch(data):
X_train, X_test, y_train, y_test = cross_validation.train_test_split(data, ground_truth_data, test_size=0.3, random_state = 0)
parameters = {'fit_intercept':[True,False], 'normalize':[True,False], 'copy_X':[True, False]}
model = linear_model.LinearRegression()
grid = GridSearchCV(model,parameters)
grid.fit(X_train, y_train)
predictions = grid.predict(X_test)
print "Grid best score: ", grid.best_score_
print "Grid score function: ", grid.score(X_test,y_test)
Output of this code is :
Grid best score: 0.720298870251
Grid score function: 0.888263112299
My question is what is the difference between best_score_
and score
function ?
How the score
function can be better than the best_score
function ?
Thanks in advance.
The best_score_
is the best score from the cross-validation. That is, the model is fit on part of the training data, and the score is computed by predicting the rest of the training data. This is because you passed X_train
and y_train
to fit
; the fit
process thus does not know anything about your test set, only your training set.
The score
method of the model object scores the model on the data you give it. You passed X_test
and y_test
, so this call computes the score of the fit (i.e., tuned) model on the test set.
In short, the two scores are calculated on different data sets, so it shouldn’t be surprising that they are different.
If I get your question right, you want to check the performance of your model right?. Is your dataset linear or nonlinear?. If it is nonlinear dataset, then you cannot base on the R squared alone. The residual plot can be of great help