XGBoost for multiclassification and imbalanced data
Question:
I am dealing with a classification problem with 3 classes [0,1,2], and imbalanced class distribution as shown below.
I want to apply XGBClassifier (in Python) to this classification problem, but the model does not respond to class_weight
adjustments and skews towards the majority class 0, and ignores the minority classes 1,2. Which hyperparameters other than class_weight
can help me?
I tried 1) computing class weights using sklearn compute_class_weight
; 2) setting weights according to the relative frequency of the classes; 3) and also manually adjusting classes with extreme values to see if any change happens at all, such as {0:0.5,1:100,2:200}
. But in any case, it does not help the classifier to take the minority classes into account.
Observations:
-
I can handle the problem in the binary case: If I make the problem a binary classification by identifying classes [1,2], then I can get the classifier work properly by adjusting scale_pos_weight
(even in this case class_weight
alone does not help).
But scale_pos_weight
, as far as I know, works for binary classification. Is there an analogue of this parameter for the multi-classification problems?
-
Using RandomForestClassifier
instead of XGBClassifier
, I can handle the problem by setting class_weight='balanced_subsample'
and tunning max_leaf_nodes
. But, for some reason, this approach does not work for XGBClassifier.
Remark: I know about balancing techniques, such as over/undersampling, or SMOTE. But I want to avoid them as much as possible, and prefer a solutions using hyperparameter tunning of the model if possible.
My observation above shows that this can work for the binary case.
Answers:
sample_weight
parameter is useful for handling imbalanced data while using XGBoost
for training the data. You can compute sample weights by using compute_sample_weight()
of sklearn
library.
This code should work for multiclass data:
from sklearn.utils.class_weight import compute_sample_weight
sample_weights = compute_sample_weight(
class_weight='balanced',
y=train_df['class'] #provide your own target name
)
xgb_classifier.fit(X, y, sample_weight=sample_weights)
You can use sample_weight as @Prakash Dahal suggested, but compute your own weights. I found that different weights made a dramatic difference (I have 12 classes and very imbalanced data).
If you compute your own weights, you need to assign the relevant weight to each entry and pass the param to the classifier in the same way:
xgb_class.fit(X_train, y_train, sample_weight=weights)
I am dealing with a classification problem with 3 classes [0,1,2], and imbalanced class distribution as shown below.
I want to apply XGBClassifier (in Python) to this classification problem, but the model does not respond to class_weight
adjustments and skews towards the majority class 0, and ignores the minority classes 1,2. Which hyperparameters other than class_weight
can help me?
I tried 1) computing class weights using sklearn compute_class_weight
; 2) setting weights according to the relative frequency of the classes; 3) and also manually adjusting classes with extreme values to see if any change happens at all, such as {0:0.5,1:100,2:200}
. But in any case, it does not help the classifier to take the minority classes into account.
Observations:
-
I can handle the problem in the binary case: If I make the problem a binary classification by identifying classes [1,2], then I can get the classifier work properly by adjusting
scale_pos_weight
(even in this caseclass_weight
alone does not help).
Butscale_pos_weight
, as far as I know, works for binary classification. Is there an analogue of this parameter for the multi-classification problems? -
Using
RandomForestClassifier
instead ofXGBClassifier
, I can handle the problem by settingclass_weight='balanced_subsample'
and tunningmax_leaf_nodes
. But, for some reason, this approach does not work for XGBClassifier.
Remark: I know about balancing techniques, such as over/undersampling, or SMOTE. But I want to avoid them as much as possible, and prefer a solutions using hyperparameter tunning of the model if possible.
My observation above shows that this can work for the binary case.
sample_weight
parameter is useful for handling imbalanced data while using XGBoost
for training the data. You can compute sample weights by using compute_sample_weight()
of sklearn
library.
This code should work for multiclass data:
from sklearn.utils.class_weight import compute_sample_weight
sample_weights = compute_sample_weight(
class_weight='balanced',
y=train_df['class'] #provide your own target name
)
xgb_classifier.fit(X, y, sample_weight=sample_weights)
You can use sample_weight as @Prakash Dahal suggested, but compute your own weights. I found that different weights made a dramatic difference (I have 12 classes and very imbalanced data).
If you compute your own weights, you need to assign the relevant weight to each entry and pass the param to the classifier in the same way:
xgb_class.fit(X_train, y_train, sample_weight=weights)