XGBoost for multiclassification and imbalanced data

Question:

I am dealing with a classification problem with 3 classes [0,1,2], and imbalanced class distribution as shown below.

enter image description here

I want to apply XGBClassifier (in Python) to this classification problem, but the model does not respond to class_weight adjustments and skews towards the majority class 0, and ignores the minority classes 1,2. Which hyperparameters other than class_weight can help me?

I tried 1) computing class weights using sklearn compute_class_weight; 2) setting weights according to the relative frequency of the classes; 3) and also manually adjusting classes with extreme values to see if any change happens at all, such as {0:0.5,1:100,2:200}. But in any case, it does not help the classifier to take the minority classes into account.

Observations:

  • I can handle the problem in the binary case: If I make the problem a binary classification by identifying classes [1,2], then I can get the classifier work properly by adjusting scale_pos_weight (even in this case class_weight alone does not help).
    But scale_pos_weight, as far as I know, works for binary classification. Is there an analogue of this parameter for the multi-classification problems?

  • Using RandomForestClassifier instead of XGBClassifier, I can handle the problem by setting class_weight='balanced_subsample' and tunning max_leaf_nodes. But, for some reason, this approach does not work for XGBClassifier.

Remark: I know about balancing techniques, such as over/undersampling, or SMOTE. But I want to avoid them as much as possible, and prefer a solutions using hyperparameter tunning of the model if possible.
My observation above shows that this can work for the binary case.

Asked By: Pooyan Moradifar

||

Answers:

sample_weight parameter is useful for handling imbalanced data while using XGBoost for training the data. You can compute sample weights by using compute_sample_weight() of sklearn library.

This code should work for multiclass data:

from sklearn.utils.class_weight import compute_sample_weight
sample_weights = compute_sample_weight(
    class_weight='balanced',
    y=train_df['class'] #provide your own target name
)

xgb_classifier.fit(X, y, sample_weight=sample_weights)
Answered By: Prakash Dahal

You can use sample_weight as @Prakash Dahal suggested, but compute your own weights. I found that different weights made a dramatic difference (I have 12 classes and very imbalanced data).
If you compute your own weights, you need to assign the relevant weight to each entry and pass the param to the classifier in the same way:
xgb_class.fit(X_train, y_train, sample_weight=weights)

Answered By: orly064