Why sklearn's KFold can only be enumerated once (also on using it in xgboost.cv)?


Trying to create a KFold object for my xgboost.cv, and I have

import pandas as pd
from sklearn.model_selection import KFold

df = pd.DataFrame([[1,2,3,4,5],[6,7,8,9,10]])

KF = KFold(n_splits=2)
kf = KF.split(df)

But it seems I can only enumerate once:

for i, (train_index, test_index) in enumerate(kf):
    print(f"Fold {i}")

for i, (train_index, test_index) in enumerate(kf):
    print(f"Again_Fold {i}")

gives output of

Fold 0
Fold 1

The second enumerate seems to be on an empty object.

I am probably fundamentally understanding something wrong, or completed messed up somewhere, but could someone explain this behavior?

[Edit, adding follow up question] This behavior seems to cause passing KFold object to xgboost.cv setting xgboost.cv(..., folds = KF.split(df)) to have index out of range error. My fix is to recreate the list of tuples with

kf = []
for i, (train_index, test_index) in enumerate(KF.split(df)):
    this_split = (list(train_index), list(test_index))

xgboost.cv(..., folds = kf)

looking for smarter solutions.

Asked By: Yue Y



Using an example:

from sklearn.model_selection import KFold
import xgboost as xgb
import numpy as np

data = np.random.rand(5, 10)  # 5 entities, each contains 10 features
label = np.random.randint(2, size=5)  # binary target
dtrain = xgb.DMatrix(data, label=label)

param = {'max_depth': 2, 'eta': 1, 'objective': 'binary:logistic'}

If we run your code :

KF = KFold(n_splits=2)
xgboost.cv(params= param,dtrain=dtrain, folds = KF.split(df))

I get the error :

IndexError                                Traceback (most recent call last)
Cell In[51], line 2
      1 KF = KFold(n_splits=2)
----> 2 xgboost.cv(params= param,dtrain=dtrain, folds = KF.split(df))

IndexError: list index out of range

In the documentation, it ask for a KFold instance, so you just need to do:

KF = KFold(n_splits=2)
xgb.cv(params= param,dtrain=dtrain, folds = KF)

You can check out the source code and see that it will call the split method, so you don’t need to provide KF.split(..) .

Answered By: StupidWolf