xgboost: Sample Weights for Imbalanced Data?

Question:

I have a highly unbalanced dataset of 3 classes. To address this, I applied the sample_weight array in the XGBClassifier, but I’m not noticing any changes in the modelling results? All of the metrics in the classification report (confusion matrix) are the same. Is there an issue with the implementation?

The class ratios:

military: 1171 
government: 34852 
other: 20869

Example:

pipeline = Pipeline([
    ('bow', CountVectorizer(analyzer=process_text)), # convert strings to integer counts
    ('tfidf', TfidfTransformer()), # convert integer counts to weighted TF-IDF scores
    ('classifier', XGBClassifier(sample_weight=compute_sample_weight(class_weight='balanced', y=y_train))) # train on TF-IDF vectors w/ Naive Bayes classifier
])

Sample of Dataset:

data = pd.DataFrame({'entity_name': ['UNICEF', 'US Military', 'Ryan Miller'], 
                     'class': ['government', 'military', 'other']})

Classification Report

enter image description here

Asked By: Riley Hun

||

Answers:

  1. First, most important: use a multiclass eval_metric. eval_metric=merror or mlogloss, then post us the results. You showed us ['precision','recall','f1-score','support'], but that’s suboptimal, or outright broken unless you computed them in a multi-class-aware, imbalanced-aware way.
  2. Second, you need weights. Your class ratio is military: government: other 1:30:18, or as percentages 2:61:37%. That’s severely imbalanced.
  • You can manually set per-class weights with xgb.DMatrix..., weights)
  • Look inside your pipeline (use print or verbose settings, dump values), don’t just blindly rely on boilerplate like sklearn.utils.class_weight.compute_sample_weight('balanced', ...) to give you optimal weights.
  • Experiment with manually setting per-class weights, starting with 1 : 1/30 : 1/18 and try more extreme values. Reciprocals so the rarer class gets higher weight.
  • Also try setting min_child_weight much higher, so it requires a few exemplars (of the minority classes). Start with min_child_weight >= 2(* weight of rarest class) and try going higher. Beware of overfitting to the very rare minority class (this is why people use StratifiedKFold crossvalidation, for some protection, but your code isn’t using CV).
  1. We can’t see your other parameters for xgboost classifier (how many estimators? early stopping on or off? what was learning_rate/eta? etc etc.). Seems like you used the defaults – they’ll be terrible. Or else you’re not showing your code. Distrust xgboost’s defaults, esp. for multiclass, don’t expect xgboost to give good out-of-the-box results. Read the doc and experiment with values.
  2. Do all that experimentation, post your results, check before concluding "it doesn’t work". Don’t expect optimal results from out-of-the-box. Distrust or double-check the sklearn util functions, try manual alternatives. (Often, just because sklearn has a function to do something, doesn’t mean it’s good or best or suitable for all use-cases, like imbalanced multiclass)
Answered By: smci