How to balance classification using DecisionTreeClassifier?
Question:
I have a data set where the classes are unbalanced. The classes are either 0
, 1
or 2
.
How can I calculate the prediction error for each class and then re-balance weights
accordingly in scikit-learn?
Answers:
If you want to fully balance (treat each class as equally important) you can simply pass class_weight='balanced'
, as it is stated in the docs:
The “balanced” mode uses the values of y to automatically adjust
weights inversely proportional to class frequencies in the input data
as n_samples / (n_classes * np.bincount(y))
If the frequency of class A is 10% and the frequency of class B is 90%, then the class B will become the dominant class and your decision tree will become biased toward the classes that are dominant
In this case, you can pass a dic {A:9,B:1}
to the model to specify the weight of each class, like
clf = tree.DecisionTreeClassifier(class_weight={A:9,B:1})
The class_weight='balanced'
will also work, It just automatically adjusts weights according to the proportion of each class frequencies
After I use class_weight='balanced'
, the record number of each class has become the same (around 88923)
You can use the class_weight, but it doesn’t seem very good at dealing with heavily unbalanced classes. There are other methods:
I’m using binary classification as an example here…
Class 0 (Under-represented): Num records x
Class 1 (Over-represented): Num records y
Oversampling: If there are x records of the under-represented class, and y records of the over-represented class then you take all y, plus x repeated (y/x)
Undersampling: If there are x records of the under-represented class, and y records of the over-represented class then you take all x plus a y-sized sample of the over-represented class
There’s also SMOTE which attempts to create synthetic records for the under-represented class: https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html
I have a data set where the classes are unbalanced. The classes are either 0
, 1
or 2
.
How can I calculate the prediction error for each class and then re-balance weights
accordingly in scikit-learn?
If you want to fully balance (treat each class as equally important) you can simply pass class_weight='balanced'
, as it is stated in the docs:
The “balanced” mode uses the values of y to automatically adjust
weights inversely proportional to class frequencies in the input data
asn_samples / (n_classes * np.bincount(y))
If the frequency of class A is 10% and the frequency of class B is 90%, then the class B will become the dominant class and your decision tree will become biased toward the classes that are dominant
In this case, you can pass a dic {A:9,B:1}
to the model to specify the weight of each class, like
clf = tree.DecisionTreeClassifier(class_weight={A:9,B:1})
The class_weight='balanced'
will also work, It just automatically adjusts weights according to the proportion of each class frequencies
After I use class_weight='balanced'
, the record number of each class has become the same (around 88923)
You can use the class_weight, but it doesn’t seem very good at dealing with heavily unbalanced classes. There are other methods:
I’m using binary classification as an example here…
Class 0 (Under-represented): Num records x
Class 1 (Over-represented): Num records y
Oversampling: If there are x records of the under-represented class, and y records of the over-represented class then you take all y, plus x repeated (y/x)
Undersampling: If there are x records of the under-represented class, and y records of the over-represented class then you take all x plus a y-sized sample of the over-represented class
There’s also SMOTE which attempts to create synthetic records for the under-represented class: https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html