Random forest class_weight and sample_weight parameters
Question:
I have a class imbalance problem and been experimenting with a weighted Random Forest using the implementation in scikit-learn (>= 0.16).
I have noticed that the implementation takes a class_weight parameter in the tree constructor and sample_weight parameter in the fit method to help solve class imbalance. Those two seem to be multiplied though to decide a final weight.
I have trouble understanding the following:
- In what stages of the tree construction/training/prediction are those weights used? I have seen some papers for weighted trees, but I am not sure what scikit implements.
- What exactly is the difference between class_weight and sample_weight?
Answers:
RandomForests are built on Trees, which are very well documented. Check how Trees use the sample weighting:
- User guide on decision trees – tells exactly what algorithm is used
- Decision tree API – explains how sample_weight is used by trees (which for random forests, as you have determined, is the product of class_weight and sample_weight).
As for the difference between class_weight
and sample_weight
: much can be determined simply by the nature of their datatypes. sample_weight
is 1D array of length n_samples
, assigning an explicit weight to each example used for training. class_weight
is either a dictionary of each class to a uniform weight for that class (e.g., {1:.9, 2:.5, 3:.01}
), or is a string telling sklearn how to automatically determine this dictionary.
So the training weight for a given example is the product of it’s explicitly named sample_weight
(or 1
if sample_weight
is not provided), and it’s class_weight
(or 1
if class_weight
is not provided).
I have a class imbalance problem and been experimenting with a weighted Random Forest using the implementation in scikit-learn (>= 0.16).
I have noticed that the implementation takes a class_weight parameter in the tree constructor and sample_weight parameter in the fit method to help solve class imbalance. Those two seem to be multiplied though to decide a final weight.
I have trouble understanding the following:
- In what stages of the tree construction/training/prediction are those weights used? I have seen some papers for weighted trees, but I am not sure what scikit implements.
- What exactly is the difference between class_weight and sample_weight?
RandomForests are built on Trees, which are very well documented. Check how Trees use the sample weighting:
- User guide on decision trees – tells exactly what algorithm is used
- Decision tree API – explains how sample_weight is used by trees (which for random forests, as you have determined, is the product of class_weight and sample_weight).
As for the difference between class_weight
and sample_weight
: much can be determined simply by the nature of their datatypes. sample_weight
is 1D array of length n_samples
, assigning an explicit weight to each example used for training. class_weight
is either a dictionary of each class to a uniform weight for that class (e.g., {1:.9, 2:.5, 3:.01}
), or is a string telling sklearn how to automatically determine this dictionary.
So the training weight for a given example is the product of it’s explicitly named sample_weight
(or 1
if sample_weight
is not provided), and it’s class_weight
(or 1
if class_weight
is not provided).