How to apply predict to xgboost cross validation
Question:
After some time searching google I feel this might be a nonsensical question, but here it goes. If I use the following code I can produce an xgb regression model, which I can then use to fit on the training set and evaluate the model
xgb_reg = xgb.XGBRegressor(objective='binary:logistic',
gamme = .12,
eval_metric = 'logloss',
#eval_metric = 'auc',
eta = .068,
subsample = .78,
colsample_bytree = .76,
min_child_weight = 9,
max_delta_step = 5,
nthread = 4)
start = time.time()
xgb_reg.fit(X_train, y_train)
print(start-time.time())
y_pred = xgb_reg.predict(X_test)
print(log_loss(y_test, y_pred))
Now, I want to take that a step further and use kfold cv to improve the model, so I have this
data_dmatrix = xgb.DMatrix(data=X_train,label=y_train)
params = {'objective':'binary:logistic','eval_metric':'logloss','eta':.068,
'subsample':.78,'colsample_bytree':.76,'min_child_weight':9,
'max_delta_step':5,'nthread':4}
xgb_cv = cv(dtrain=data_dmatrix, params=params, nfold=5, num_boost_round=20, metrics = 'logloss',seed=42)
However, this spits out a data frame and I can’t use the .predict() on the test set.
I’m thinking I might not be understanding the fundamental concept of this but I’m hoping I’m just overlooking something simple.
Answers:
kfold cv doesn’t make the model more accurate per se. In your example with xgb, there are many hyper parameters eg(subsample, eta) to be specified, and to get a sense of how the parameters chosen perform on unseen data, we use kfold cv to partition the data into many training and test samples and measure out-of-sample accuracy.
We usually try this for several possible values of a parameter and what gives the lowest average error. After this you would refit your model with the parameters. This post and its answers discusses it.
For example, below we run something like what you did and we get only the train / test error for 1 set of values :
import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=500,class_sep=0.7)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.33, random_state=42)
data_dmatrix = xgb.DMatrix(data=X_train,label=y_train)
params = {'objective':'binary:logistic','eval_metric':'logloss',
'eta':0.01,
'subsample':0.1}
xgb_cv = xgb.cv(dtrain=data_dmatrix, params=params, nfold=5, metrics = 'logloss',seed=42)
train-logloss-mean train-logloss-std test-logloss-mean test-logloss-std
0 0.689600 0.000517 0.689820 0.001009
1 0.686462 0.001612 0.687151 0.002089
2 0.683626 0.001438 0.684667 0.003009
3 0.680450 0.001100 0.681929 0.003604
4 0.678269 0.001399 0.680310 0.002781
5 0.675170 0.001867 0.677254 0.003086
6 0.672349 0.002483 0.674432 0.004349
7 0.668964 0.002484 0.671493 0.004579
8 0.666361 0.002831 0.668978 0.004200
9 0.663682 0.003881 0.666744 0.003598
The last row is the result from last round, which is what we use for evaluation.
If we test over multiple values of eta
( and subsample
for example:
grid = pd.DataFrame({'eta':[0.01,0.05,0.1]*2,
'subsample':np.repeat([0.1,0.3],3)})
eta subsample
0 0.01 0.1
1 0.05 0.1
2 0.10 0.1
3 0.01 0.3
4 0.05 0.3
5 0.10 0.3
Normally we can use GridSearchCV for this, but below is something that uses xgb.cv:
def fit(x):
params = {'objective':'binary:logistic',
'eval_metric':'logloss',
'eta':x[0],
'subsample':x[1]}
xgb_cv = xgb.cv(dtrain=data_dmatrix, params=params,
nfold=5, metrics = 'logloss',seed=42)
return xgb_cv[-1:].values[0]
grid[['train-logloss-mean','train-logloss-std',
'test-logloss-mean','test-logloss-std']] = grid.apply(fit,axis=1,result_type='expand')
eta subsample train-logloss-mean train-logloss-std test-logloss-mean test-logloss-std
0 0.01 0.1 0.663682 0.003881 0.666744 0.003598
1 0.05 0.1 0.570629 0.012555 0.580309 0.023561
2 0.10 0.1 0.503440 0.017761 0.526891 0.031659
3 0.01 0.3 0.646587 0.002063 0.653741 0.004201
4 0.05 0.3 0.512229 0.008013 0.545113 0.018700
5 0.10 0.3 0.414103 0.012427 0.472379 0.032606
We can see for eta = 0.10
and subsample = 0.3
gives the best result, so next you just need to refit the model with these parameters:
xgb_reg = xgb.XGBRegressor(objective='binary:logistic',
eval_metric = 'logloss',
eta = 0.1,
subsample = 0.3)
xgb_reg.fit(X_train, y_train)
You can get this info from: https://xgboost.readthedocs.io/en/stable/parameter.html
[default=reg:squarederror]
The objective function for regressor should be what is here and not ‘binary:logistic’ as was originally posted and carried forward. My statement was just to point that out as @innovator-programmer pointed out
After some time searching google I feel this might be a nonsensical question, but here it goes. If I use the following code I can produce an xgb regression model, which I can then use to fit on the training set and evaluate the model
xgb_reg = xgb.XGBRegressor(objective='binary:logistic',
gamme = .12,
eval_metric = 'logloss',
#eval_metric = 'auc',
eta = .068,
subsample = .78,
colsample_bytree = .76,
min_child_weight = 9,
max_delta_step = 5,
nthread = 4)
start = time.time()
xgb_reg.fit(X_train, y_train)
print(start-time.time())
y_pred = xgb_reg.predict(X_test)
print(log_loss(y_test, y_pred))
Now, I want to take that a step further and use kfold cv to improve the model, so I have this
data_dmatrix = xgb.DMatrix(data=X_train,label=y_train)
params = {'objective':'binary:logistic','eval_metric':'logloss','eta':.068,
'subsample':.78,'colsample_bytree':.76,'min_child_weight':9,
'max_delta_step':5,'nthread':4}
xgb_cv = cv(dtrain=data_dmatrix, params=params, nfold=5, num_boost_round=20, metrics = 'logloss',seed=42)
However, this spits out a data frame and I can’t use the .predict() on the test set.
I’m thinking I might not be understanding the fundamental concept of this but I’m hoping I’m just overlooking something simple.
kfold cv doesn’t make the model more accurate per se. In your example with xgb, there are many hyper parameters eg(subsample, eta) to be specified, and to get a sense of how the parameters chosen perform on unseen data, we use kfold cv to partition the data into many training and test samples and measure out-of-sample accuracy.
We usually try this for several possible values of a parameter and what gives the lowest average error. After this you would refit your model with the parameters. This post and its answers discusses it.
For example, below we run something like what you did and we get only the train / test error for 1 set of values :
import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=500,class_sep=0.7)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.33, random_state=42)
data_dmatrix = xgb.DMatrix(data=X_train,label=y_train)
params = {'objective':'binary:logistic','eval_metric':'logloss',
'eta':0.01,
'subsample':0.1}
xgb_cv = xgb.cv(dtrain=data_dmatrix, params=params, nfold=5, metrics = 'logloss',seed=42)
train-logloss-mean train-logloss-std test-logloss-mean test-logloss-std
0 0.689600 0.000517 0.689820 0.001009
1 0.686462 0.001612 0.687151 0.002089
2 0.683626 0.001438 0.684667 0.003009
3 0.680450 0.001100 0.681929 0.003604
4 0.678269 0.001399 0.680310 0.002781
5 0.675170 0.001867 0.677254 0.003086
6 0.672349 0.002483 0.674432 0.004349
7 0.668964 0.002484 0.671493 0.004579
8 0.666361 0.002831 0.668978 0.004200
9 0.663682 0.003881 0.666744 0.003598
The last row is the result from last round, which is what we use for evaluation.
If we test over multiple values of eta
( and subsample
for example:
grid = pd.DataFrame({'eta':[0.01,0.05,0.1]*2,
'subsample':np.repeat([0.1,0.3],3)})
eta subsample
0 0.01 0.1
1 0.05 0.1
2 0.10 0.1
3 0.01 0.3
4 0.05 0.3
5 0.10 0.3
Normally we can use GridSearchCV for this, but below is something that uses xgb.cv:
def fit(x):
params = {'objective':'binary:logistic',
'eval_metric':'logloss',
'eta':x[0],
'subsample':x[1]}
xgb_cv = xgb.cv(dtrain=data_dmatrix, params=params,
nfold=5, metrics = 'logloss',seed=42)
return xgb_cv[-1:].values[0]
grid[['train-logloss-mean','train-logloss-std',
'test-logloss-mean','test-logloss-std']] = grid.apply(fit,axis=1,result_type='expand')
eta subsample train-logloss-mean train-logloss-std test-logloss-mean test-logloss-std
0 0.01 0.1 0.663682 0.003881 0.666744 0.003598
1 0.05 0.1 0.570629 0.012555 0.580309 0.023561
2 0.10 0.1 0.503440 0.017761 0.526891 0.031659
3 0.01 0.3 0.646587 0.002063 0.653741 0.004201
4 0.05 0.3 0.512229 0.008013 0.545113 0.018700
5 0.10 0.3 0.414103 0.012427 0.472379 0.032606
We can see for eta = 0.10
and subsample = 0.3
gives the best result, so next you just need to refit the model with these parameters:
xgb_reg = xgb.XGBRegressor(objective='binary:logistic',
eval_metric = 'logloss',
eta = 0.1,
subsample = 0.3)
xgb_reg.fit(X_train, y_train)
You can get this info from: https://xgboost.readthedocs.io/en/stable/parameter.html
[default=reg:squarederror]
The objective function for regressor should be what is here and not ‘binary:logistic’ as was originally posted and carried forward. My statement was just to point that out as @innovator-programmer pointed out