How Feature Importance is calculated in sklearn's RandomForest?

Question:

From this Tutorial and Feature Importance
I try to make my own random forest tree

data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target


X = df.loc[:, df.columns != 'target']
y = df.loc[:, 'target'].values


X_train, X_test, Y_train, Y_test = train_test_split(X, y, random_state=0)


rf = RandomForestClassifier(n_estimators=1,
                            max_depth=2,
                            max_features=2,
                            random_state=0)
rf.fit(X_train, Y_train)
rf.feature_importances_
array([0.        , 0.11197953, 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.88802047, 0.        , 0.        , 0.        ])
fn=data.feature_names
cn=data.target_names
fig, axes = plt.subplots(nrows = 1,ncols = 1,figsize = (4,4), dpi=800)
tree.plot_tree(rf.estimators_[0],
               feature_names = fn, 
               class_names=cn,
               filled = True);
fig.savefig('rf_individualtree.png')

A single random forest tree
calculate the Feature Importance by hand from above Feature Importance (result from sklearn 0.11197953, 0.88802047)

a = (192/265)*(0.262-(68/192)*0.452-(124/192)*0.103) 
b = (265/265)*(0.459-(192/265)*0.262-(73/265)*0.185)+(73/265)*(0.185-(72/73)*0.173)

print(b/(a+b))
print(a/(a+b))
0.8625754868011606
0.13742451319883947

Which part I did wrong my result is different from sklearn answer or sklearn just don’t follow the formula?

Answers:

You have couple of problems:

  1. Rounding error
  2. Math, specifically calculating probability of reaching a node

As soon as you correct them, you’ll get the sklearn’s result:

print(rf.estimators_[0].tree_.impurity)

array([0.45899182, 0.26172737, 0.10250188, 0.45244126, 0.18549346,
       0.17300567, 0.        ])

n1 = 0.45899182261015226 - (310/426)*0.26172736732570234 - (116/426)*0.1854934601664685
n2 = (116/426)*0.1854934601664685 - (115/426)*0.17300567107750475
n3 = (310/426)*0.26172736732570234 - (203/426)*0.10250188065713806 - (107/426)*0.45244126124552364
f1 = n1+n2
f2 = n3
print(f1/(f1+f2), f2/(f1+f2))

(0.888020474590027, 0.11197952540997297)

(You may read more on how importance is calculated here by package developers or here by reading the source code)

Note as well, what RandomForest considers important may be not so important for another model (and vice versa), i.e. "importance" here is model specific, and probably may be not so intuitively understandable or expected by people, who are more accustomed to linear explainability.

Answered By: Sergey Bushmanov

why the number of samples is not each to the sum of values in each node of your tree? For example, look at the root of the tree, samples=265, but the value=[152, 274], the sum of elements in value is 426. I believe Sergey Bushmanov gives the correct way to compute the feature importance. but your tree looks incorrect.

Answered By: chunyan li
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.