XgBoost : The least populated class in y has only 1 members, which is too few

Question:

Im using Xgboost implementation on sklearn for a kaggle’s competition.
However, im getting this ‘warning’ message :

$ python Script1.py
/home/sky/private/virtualenv15.0.1dev/myVE/local/lib/python2.7/site-packages/sklearn/cross_validation.py:516: 

Warning: The least populated class in y has only 1 members, which is too few. The minimum number of labels for any class cannot be less than n_folds=3.
  % (min_labels, self.n_folds)), Warning)

According to another question on stackoverflow :

Check that you have at least 3 samples per class to be able to do StratifiedKFold cross validation with k == 3 (I think this is the default CV used by GridSearchCV for classification)."

And well, i dont have at least 3 samples per class.

So my questions are:

  1. what are the alternatives?

  2. Why can’t i use cross validation?

  3. What can i use instead?

param_test1 = {
    'max_depth': range(3, 10, 2),
    'min_child_weight': range(1, 6, 2)
}

grid_search = GridSearchCV(

estimator=
XGBClassifier(
    learning_rate=0.1,
    n_estimators=3000,
    max_depth=15,
    min_child_weight=1,
    gamma=0,
    subsample=0.8,
    colsample_bytree=0.8,
    objective='multi:softmax',
    nthread=42,
    scale_pos_weight=1,
    seed=27),

    param_grid=param_test1, scoring='roc_auc', n_jobs=42, iid=False, cv=None, verbose=1)
...

grid_search.fit(train_x, place_id)

References:

One-shot learning with scikit-learn

Using a support vector classifier with polynomial kernel in scikit-learn

Asked By: KenobiBastila

||

Answers:

If you have a target/class with only one sample, thats too few for any model. What you can do is get another dataset, preferably as balanced as possible, since most models behave better in balanced sets.

If you cannot have another dataset, you will have to play with what you have. I would suggest you remove the sample that has the lonely target. So you will have a model which does not cover that target. If that does not fit you requirements, you need a new dataset.

Answered By: Rabbit