I was going through sklearn class DecisionTreeClassifier.
Looking at parameters for the class, we have two parameters min_samples_split and min_samples_leaf. Basic idea behind them looks similar, you specify a minimum number of samples required to decide a node to be leaf or split further.
Why do we need two parameters when one implies the other?. Is there any reason or scenario which distinguish them?.
From the documentation:
The main difference between the two is that
min_samples_leafguarantees a minimum number of samples in a leaf, while
min_samples_splitcan create arbitrary small leaves, though
min_samples_splitis more common in the literature.
To get a grasp of this piece of documentation I think you should make the distinction between a leaf (also called external node) and an internal node. An internal node will have further splits (also called children), while a leaf is by definition a node without any children (without any further splits).
min_samples_split specifies the minimum number of samples required to split an internal node, while
min_samples_leaf specifies the minimum number of samples required to be at a leaf node.
For instance, if
min_samples_split = 5, and there are 7 samples at an internal node, then the split is allowed. But let’s say the split results in two leaves, one with 1 sample, and another with 6 samples. If
min_samples_leaf = 2, then the split won’t be allowed (even if the internal node has 7 samples) because one of the leaves resulted will have less then the minimum number of samples required to be at a leaf node.
As the documentation referenced above mentions,
min_samples_leaf guarantees a minimum number of samples in every leaf, no matter the value of
In decision trees, there are many rules one can set up to configure how the tree should end up. Roughly, there are more ‘design’ oriented rules like max_depth. Max_depth is more like when you build a house, the architect asks you how many floors you want on the house.
Some other rules are ‘defensive’ rules. We often call them stopping rules. min_samples_leaf and min_samples_split belong to this type. All explanations already provided are very well said. My cent: rules interact when the tree is being built. For example, min_samples_leaf=100, you may very well end up with tree where all the terminal nodes are way larger than 100 because others rule kick in to have stopped the tree from expanding.
Both parameters will produce similar results, the difference is the point of view.
The min_samples_split parameter will evaluate the number of samples in the node, and if the number is less than the minimum the split will be avoided and the node will be a leaf.
The min_samples_leaf parameter checks before the node is generated, that is, if the possible split results in a child with fewer samples, the split will be avoided (since the minimum number of samples for the child to be a leaf has not been reached) and the node will be replaced by a leaf.
In all cases, when we have samples with more than one Class in a leaf, the Final Class will be the most likely to happen, according to the samples that reached it in training.
min_sample_split tells above the minimum no. of samples reqd. to split an internal node. If an integer value is taken then consider min_samples_split as the minimum no. If float, then it shows the percentage. By default, it takes “2” value.
min_sample_leaf is the minimum number of samples required to be at a leaf node. If an integer value is taken then consider – -min_samples_leaf as the minimum no. If float, then it shows the percentage. By default, it takes “1” value.
lets say that min_samples_split = 9 and min_samples_leaf =3 .
in the internal node, left split is allowed (10>9) and right split not allowed (3<9).
but because min_samples_leaf =3 and one leaf is 2 (the right one) so it will not split
Look at the leaf with the number 3 (from the first splitting).
If we decide that mim_samples_leaf =4 and not 3 so even the first splitting would not be happen (13 to 10 and 3) .