How does scikit-learn DecisionTreeClassifier handle duplicate values when determining potential split points for a continuous predictor variable?

Question:

Suppose I have a continuous predictor variable with values of 10, 20, 20, 30. I understand that the set of potential split thresholds would include {15, 25}, as these are the means of 10 & 20 and of 20 & 30, respectively. But would 20 also be included as a potential split threshold because it is the mean of 20 & 20, or do repeated values in the sorted array get skipped?

Note that I’m not asking about the metric used to select the best split threshold (gini, entropy, log-loss, etc.). I’m asking about the upstream process of identifying the potential thresholds that will be evaluated with this metric.

My coding skills aren’t strong enough to understand the scikit-learn source code, but I think this information might be found here. I cannot find anything in the documentation itself about this though.

Asked By: NaiveBae

||

Answers:

No, in your example 20 is not considered as a valid split point. Since the splits are taken as f_i <= threshold vs f_i > threshold, in your example a threshold of 20 and a threshold of 25 are actually the same anyway.

In the code that you linked (I’m looking at BestSplitter), after sorting the feature values, it loops through the indices p, but skips over those with equal values:

                while p + 1 < end and Xf[p + 1] <= Xf[p] + FEATURE_THRESHOLD:
                    p += 1

[source] (FEATURE_THRESHOLD is very small and handles precision issues)

Answered By: Ben Reiniger