imbalanced-learn: how is the threshold calculated in the instance hardness threshold method?

Question:

I am looking at the source code of the InstanceHardnessThreshold transformer from imbalanced-learn, here: https://github.com/scikit-learn-contrib/imbalanced-learn/blob/12b2e0d/imblearn/under_sampling/_prototype_selection/_instance_hardness_threshold.py#L167

And I am wondering how exactly the threshold is calculated and what the rationale is?

Asked By: Sole Galli

||

Answers:

After discussing with the maintainers of the imbalanced-learn package, this is what I learned:

The threshold is determined as follows:

threshold = np.percentile(
            probabilities[y == target_class],
            (1.0 - (n_samples / target_stats[target_class])) * 100.0,
)

where n_samples is the number of samples desired in the final dataset from the majority class and target_stats[target_class] is the total number of the majority class present in the original dataset.

We need to find a probability threshold such that the number of samples above that threshold agrees with the number of samples requested in sampling_strategy. By default, it will be the number of samples in the minority class, unless the users declares otherwise.

Instance hardness is the probability of an observation being miss classified. In other words, it is 1 – probability of the class.

The idea is that the probabilities given by the estimator are related to the certainty for a sample to belong to the class. Therefore, a percentile of 0.0 would mean that we select all samples while a percentile of 1.0 mean that we will select a single sample (the one with the maximum probability). So the threshold corresponds to select the N most certain samples to belong to class C as seen per the estimator. N is defined by the sampling_strategy parameter (e.g., the expected balancing ratio).

This method may return more observations than those requested by the user. This is mentioned in the documentation.

Answered By: Sole Galli
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.