How to find the importance of the features for a logistic regression model?

Question:

I have a binary prediction model trained by logistic regression algorithm. I want know which features (predictors) are more important for the decision of positive or negative class. I know there is coef_ parameter which comes from the scikit-learn package, but I don’t know whether it is enough for the importance. Another thing is how I can evaluate the coef_ values in terms of the importance for negative and positive classes. I also read about standardized regression coefficients and I don’t know what it is.

Lets say there are features like size of tumor, weight of tumor, and etc to make a decision for a test case like malignant or not malignant. I want to know which of the features are more important for malignant and not malignant prediction.

Asked By: mgokhanbakal

||

Answers:

One of the simplest options to get a feeling for the “influence” of a given parameter in a linear classification model (logistic being one of those), is to consider the magnitude of its coefficient times the standard deviation of the corresponding parameter in the data.

Consider this example:

import numpy as np    
from sklearn.linear_model import LogisticRegression

x1 = np.random.randn(100)
x2 = 4*np.random.randn(100)
x3 = 0.5*np.random.randn(100)
y = (3 + x1 + x2 + x3 + 0.2*np.random.randn()) > 0
X = np.column_stack([x1, x2, x3])

m = LogisticRegression()
m.fit(X, y)

# The estimated coefficients will all be around 1:
print(m.coef_)

# Those values, however, will show that the second parameter
# is more influential
print(np.std(X, 0)*m.coef_)

An alternative way to get a similar result is to examine the coefficients of the model fit on standardized parameters:

m.fit(X / np.std(X, 0), y)
print(m.coef_)

Note that this is the most basic approach and a number of other techniques for finding feature importance or parameter influence exist (using p-values, bootstrap scores, various “discriminative indices”, etc).

I am pretty sure you would get more interesting answers at https://stats.stackexchange.com/.

Answered By: KT.

Since scikit-learn 0.22, sklearn defines a sklearn.inspection module which implements permutation_importance, which can be used to find the most important features – higher value indicates higher "importance" or the the corresponding feature contributes a larger fraction of whatever metrics was used to evaluate the model (the default for LogisticRegression is accuracy).

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.inspection import permutation_importance

# initialize sample (using the same setup as in KT.'s)
X = np.random.standard_normal((100,3)) * [1, 4, 0.5]
y = (3 + X.sum(axis=1) + 0.2*np.random.standard_normal()) > 0

# fit a model
model = LogisticRegression().fit(X, y)
# compute importances
model_fi = permutation_importance(model, X, y)
model_fi['importances_mean']                    # array([0.07 , 0.352, 0.02 ])

So in the example above, the most important feature is the second feature, followed by the first and the third. This is the same ordinal ranking as the one suggested in KT.’s post.

One nice thing about permutation_importance is that both training and test datasets may be passed to it to identify which features might cause the model to overfit.


You can read more about it in the documentation, you can even find the outline of the algorithm.

Answered By: cottontail