# How to find the importance of the features for a logistic regression model?

## Question:

I have a binary prediction model trained by logistic regression algorithm. I want know which features (predictors) are more important for the decision of positive or negative class. I know there is `coef_` parameter which comes from the scikit-learn package, but I don’t know whether it is enough for the importance. Another thing is how I can evaluate the `coef_` values in terms of the importance for negative and positive classes. I also read about standardized regression coefficients and I don’t know what it is.

Lets say there are features like size of tumor, weight of tumor, and etc to make a decision for a test case like malignant or not malignant. I want to know which of the features are more important for malignant and not malignant prediction.

One of the simplest options to get a feeling for the “influence” of a given parameter in a linear classification model (logistic being one of those), is to consider the magnitude of its coefficient times the standard deviation of the corresponding parameter in the data.

Consider this example:

``````import numpy as np
from sklearn.linear_model import LogisticRegression

x1 = np.random.randn(100)
x2 = 4*np.random.randn(100)
x3 = 0.5*np.random.randn(100)
y = (3 + x1 + x2 + x3 + 0.2*np.random.randn()) > 0
X = np.column_stack([x1, x2, x3])

m = LogisticRegression()
m.fit(X, y)

# The estimated coefficients will all be around 1:
print(m.coef_)

# Those values, however, will show that the second parameter
# is more influential
print(np.std(X, 0)*m.coef_)
``````

An alternative way to get a similar result is to examine the coefficients of the model fit on standardized parameters:

``````m.fit(X / np.std(X, 0), y)
print(m.coef_)
``````

Note that this is the most basic approach and a number of other techniques for finding feature importance or parameter influence exist (using p-values, bootstrap scores, various “discriminative indices”, etc).

I am pretty sure you would get more interesting answers at https://stats.stackexchange.com/.

Since scikit-learn 0.22, `sklearn` defines a `sklearn.inspection` module which implements `permutation_importance`, which can be used to find the most important features – higher value indicates higher "importance" or the the corresponding feature contributes a larger fraction of whatever metrics was used to evaluate the model (the default for `LogisticRegression` is accuracy).

``````import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.inspection import permutation_importance

# initialize sample (using the same setup as in KT.'s)
X = np.random.standard_normal((100,3)) * [1, 4, 0.5]
y = (3 + X.sum(axis=1) + 0.2*np.random.standard_normal()) > 0

# fit a model
model = LogisticRegression().fit(X, y)
# compute importances
model_fi = permutation_importance(model, X, y)
model_fi['importances_mean']                    # array([0.07 , 0.352, 0.02 ])
``````

So in the example above, the most important feature is the second feature, followed by the first and the third. This is the same ordinal ranking as the one suggested in KT.’s post.

One nice thing about `permutation_importance` is that both training and test datasets may be passed to it to identify which features might cause the model to overfit.

You can read more about it in the documentation, you can even find the outline of the algorithm.

Categories: questions
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.