sklearn GridSearchCV with Pipeline
Question:
I’m new to sklearn
‘s Pipeline
and GridSearchCV
features. I am trying to build a pipeline which first does RandomizedPCA on my training data and then fits a ridge regression model. Here is my code:
pca = RandomizedPCA(1000, whiten=True)
rgn = Ridge()
pca_ridge = Pipeline([('pca', pca),
('ridge', rgn)])
parameters = {'ridge__alpha': 10 ** np.linspace(-5, -2, 3)}
grid_search = GridSearchCV(pca_ridge, parameters, cv=2, n_jobs=1, scoring='mean_squared_error')
grid_search.fit(train_x, train_y[:, 1:])
I know about the RidgeCV
function but I want to try out Pipeline and GridSearch CV.
I want the grid search CV to report RMSE error, but this doesn’t seem supported in sklearn so I’m making do with MSE. However, the scores it resports are negative:
In [41]: grid_search.grid_scores_
Out[41]:
[mean: -0.02665, std: 0.00007, params: {'ridge__alpha': 1.0000000000000001e-05},
mean: -0.02658, std: 0.00009, params: {'ridge__alpha': 0.031622776601683791},
mean: -0.02626, std: 0.00008, params: {'ridge__alpha': 100.0}]
Obviously this isn’t possible for mean squared error – what am I doing wrong here?
Answers:
Those scores are negative MSE scores, i.e. negate them and you get the MSE. The thing is that GridSearchCV
, by convention, always tries to maximize its score so loss functions like MSE have to be negated.
If you want to get RMSE as a metric you can write your own callable/function which will take Y_pred and Y_org and calculate the RMSE.
Suppose, I have stored results of negative MSE and negative MAE obtained from GridSearchCV in lists named as model_nmse and model_nmae respectively .
So i would simply multiply it with (-1) , to get desired MSE and MAE scores.
model_mse = list(np.multiply(model_nmse , -1))
model_mae = list(np.multiply(model_nmae , -1))
An alternate way to create GridSearchCV
is to use make_scorer
and turn greater_is_better
flag to False
So, if clf is your classifier, and parameters are your hyperparameter lists, you can use the make_scorer
like this:
from sklearn.metrics import make_scorer
#define your own mse and set greater_is_better=False
mse = make_scorer(mean_squared_error,greater_is_better=False)
Now, same as below, you can call the GridSearch and pass your defined mse
grid_obj = GridSearchCV(clf, parameters, cv=5,scoring=mse,n_jobs = -1, verbose=True)
You can see the scoring in the documentation
I’m new to sklearn
‘s Pipeline
and GridSearchCV
features. I am trying to build a pipeline which first does RandomizedPCA on my training data and then fits a ridge regression model. Here is my code:
pca = RandomizedPCA(1000, whiten=True)
rgn = Ridge()
pca_ridge = Pipeline([('pca', pca),
('ridge', rgn)])
parameters = {'ridge__alpha': 10 ** np.linspace(-5, -2, 3)}
grid_search = GridSearchCV(pca_ridge, parameters, cv=2, n_jobs=1, scoring='mean_squared_error')
grid_search.fit(train_x, train_y[:, 1:])
I know about the RidgeCV
function but I want to try out Pipeline and GridSearch CV.
I want the grid search CV to report RMSE error, but this doesn’t seem supported in sklearn so I’m making do with MSE. However, the scores it resports are negative:
In [41]: grid_search.grid_scores_
Out[41]:
[mean: -0.02665, std: 0.00007, params: {'ridge__alpha': 1.0000000000000001e-05},
mean: -0.02658, std: 0.00009, params: {'ridge__alpha': 0.031622776601683791},
mean: -0.02626, std: 0.00008, params: {'ridge__alpha': 100.0}]
Obviously this isn’t possible for mean squared error – what am I doing wrong here?
Those scores are negative MSE scores, i.e. negate them and you get the MSE. The thing is that GridSearchCV
, by convention, always tries to maximize its score so loss functions like MSE have to be negated.
If you want to get RMSE as a metric you can write your own callable/function which will take Y_pred and Y_org and calculate the RMSE.
Suppose, I have stored results of negative MSE and negative MAE obtained from GridSearchCV in lists named as model_nmse and model_nmae respectively .
So i would simply multiply it with (-1) , to get desired MSE and MAE scores.
model_mse = list(np.multiply(model_nmse , -1))
model_mae = list(np.multiply(model_nmae , -1))
An alternate way to create GridSearchCV
is to use make_scorer
and turn greater_is_better
flag to False
So, if clf is your classifier, and parameters are your hyperparameter lists, you can use the make_scorer
like this:
from sklearn.metrics import make_scorer
#define your own mse and set greater_is_better=False
mse = make_scorer(mean_squared_error,greater_is_better=False)
Now, same as below, you can call the GridSearch and pass your defined mse
grid_obj = GridSearchCV(clf, parameters, cv=5,scoring=mse,n_jobs = -1, verbose=True)
You can see the scoring in the documentation