How to apply predict to xgboost cross validation

Question:

After some time searching google I feel this might be a nonsensical question, but here it goes. If I use the following code I can produce an xgb regression model, which I can then use to fit on the training set and evaluate the model

xgb_reg = xgb.XGBRegressor(objective='binary:logistic',
                           gamme = .12, 
                           eval_metric = 'logloss',
                           #eval_metric = 'auc', 
                           eta = .068,
                           subsample = .78,
                           colsample_bytree = .76,
                           min_child_weight = 9,
                           max_delta_step = 5,
                           nthread = 4)

start = time.time()
xgb_reg.fit(X_train, y_train)
print(start-time.time())

y_pred = xgb_reg.predict(X_test)
print(log_loss(y_test, y_pred))

Now, I want to take that a step further and use kfold cv to improve the model, so I have this

data_dmatrix = xgb.DMatrix(data=X_train,label=y_train)
params = {'objective':'binary:logistic','eval_metric':'logloss','eta':.068,
          'subsample':.78,'colsample_bytree':.76,'min_child_weight':9,
          'max_delta_step':5,'nthread':4}
xgb_cv = cv(dtrain=data_dmatrix, params=params, nfold=5, num_boost_round=20, metrics = 'logloss',seed=42) 

However, this spits out a data frame and I can’t use the .predict() on the test set.

I’m thinking I might not be understanding the fundamental concept of this but I’m hoping I’m just overlooking something simple.

Asked By: jon

||

Answers:

kfold cv doesn’t make the model more accurate per se. In your example with xgb, there are many hyper parameters eg(subsample, eta) to be specified, and to get a sense of how the parameters chosen perform on unseen data, we use kfold cv to partition the data into many training and test samples and measure out-of-sample accuracy.

We usually try this for several possible values of a parameter and what gives the lowest average error. After this you would refit your model with the parameters. This post and its answers discusses it.

For example, below we run something like what you did and we get only the train / test error for 1 set of values :

import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=500,class_sep=0.7)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.33, random_state=42)

data_dmatrix = xgb.DMatrix(data=X_train,label=y_train)
params = {'objective':'binary:logistic','eval_metric':'logloss',
          'eta':0.01,
          'subsample':0.1}
xgb_cv = xgb.cv(dtrain=data_dmatrix, params=params, nfold=5, metrics = 'logloss',seed=42) 

                 train-logloss-mean  train-logloss-std  test-logloss-mean  test-logloss-std
0            0.689600           0.000517           0.689820          0.001009
1            0.686462           0.001612           0.687151          0.002089
2            0.683626           0.001438           0.684667          0.003009
3            0.680450           0.001100           0.681929          0.003604
4            0.678269           0.001399           0.680310          0.002781
5            0.675170           0.001867           0.677254          0.003086
6            0.672349           0.002483           0.674432          0.004349
7            0.668964           0.002484           0.671493          0.004579
8            0.666361           0.002831           0.668978          0.004200
9            0.663682           0.003881           0.666744          0.003598

The last row is the result from last round, which is what we use for evaluation.

If we test over multiple values of eta ( and subsample for example:

grid = pd.DataFrame({'eta':[0.01,0.05,0.1]*2,
'subsample':np.repeat([0.1,0.3],3)})

    eta  subsample
0  0.01        0.1
1  0.05        0.1
2  0.10        0.1
3  0.01        0.3
4  0.05        0.3
5  0.10        0.3

Normally we can use GridSearchCV for this, but below is something that uses xgb.cv:

def fit(x):
    params = {'objective':'binary:logistic',
              'eval_metric':'logloss',
              'eta':x[0],
              'subsample':x[1]}
    xgb_cv = xgb.cv(dtrain=data_dmatrix, params=params, 
    nfold=5, metrics = 'logloss',seed=42)
    return xgb_cv[-1:].values[0]

grid[['train-logloss-mean','train-logloss-std',
'test-logloss-mean','test-logloss-std']] = grid.apply(fit,axis=1,result_type='expand')

    eta  subsample  train-logloss-mean  train-logloss-std  test-logloss-mean  test-logloss-std
0  0.01        0.1            0.663682           0.003881           0.666744          0.003598
1  0.05        0.1            0.570629           0.012555           0.580309          0.023561
2  0.10        0.1            0.503440           0.017761           0.526891          0.031659
3  0.01        0.3            0.646587           0.002063           0.653741          0.004201
4  0.05        0.3            0.512229           0.008013           0.545113          0.018700
5  0.10        0.3            0.414103           0.012427           0.472379          0.032606

We can see for eta = 0.10 and subsample = 0.3 gives the best result, so next you just need to refit the model with these parameters:

xgb_reg = xgb.XGBRegressor(objective='binary:logistic',
                           eval_metric = 'logloss',
                           eta = 0.1,
                           subsample = 0.3)

xgb_reg.fit(X_train, y_train)
Answered By: StupidWolf

You can get this info from: https://xgboost.readthedocs.io/en/stable/parameter.html

[default=reg:squarederror]

The objective function for regressor should be what is here and not ‘binary:logistic’ as was originally posted and carried forward. My statement was just to point that out as @innovator-programmer pointed out

Answered By: Srinivas
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.