How to get most informative features for scikit-learn classifiers?
Question:
The classifiers in machine learning packages like liblinear and nltk offer a method show_most_informative_features()
, which is really helpful for debugging features:
viagra = None ok : spam = 4.5 : 1.0
hello = True ok : spam = 4.5 : 1.0
hello = None spam : ok = 3.3 : 1.0
viagra = True spam : ok = 3.3 : 1.0
casino = True spam : ok = 2.0 : 1.0
casino = None ok : spam = 1.5 : 1.0
My question is if something similar is implemented for the classifiers in scikit-learn. I searched the documentation, but couldn’t find anything the like.
If there is no such function yet, does somebody know a workaround how to get to those values?
Answers:
The classifiers themselves do not record feature names, they just see numeric arrays. However, if you extracted your features using a Vectorizer
/CountVectorizer
/TfidfVectorizer
/DictVectorizer
, and you are using a linear model (e.g. LinearSVC
or Naive Bayes) then you can apply the same trick that the document classification example uses. Example (untested, may contain a bug or two):
def print_top10(vectorizer, clf, class_labels):
"""Prints features with the highest coefficient values, per class"""
feature_names = vectorizer.get_feature_names()
for i, class_label in enumerate(class_labels):
top10 = np.argsort(clf.coef_[i])[-10:]
print("%s: %s" % (class_label,
" ".join(feature_names[j] for j in top10)))
This is for multiclass classification; for the binary case, I think you should use clf.coef_[0]
only. You may have to sort the class_labels
.
With the help of larsmans code I came up with this code for the binary case:
def show_most_informative_features(vectorizer, clf, n=20):
feature_names = vectorizer.get_feature_names()
coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
for (coef_1, fn_1), (coef_2, fn_2) in top:
print "t%.4ft%-15stt%.4ft%-15s" % (coef_1, fn_1, coef_2, fn_2)
RandomForestClassifier
does not yet have a coef_
attrubute, but it will in the 0.17 release, I think. However, see the RandomForestClassifierWithCoef
class in Recursive feature elimination on Random Forest using scikit-learn. This may give you some ideas to work around the limitation above.
You can also do something like this to create a graph of importance features by order:
importances = clf.feature_importances_
std = np.std([tree.feature_importances_ for tree in clf.estimators_],
axis=0)
indices = np.argsort(importances)[::-1]
# Print the feature ranking
#print("Feature ranking:")
# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(train[features].shape[1]), importances[indices],
color="r", yerr=std[indices], align="center")
plt.xticks(range(train[features].shape[1]), indices)
plt.xlim([-1, train[features].shape[1]])
plt.show()
To add an update, RandomForestClassifier
now supports the .feature_importances_
attribute. This attribute tells you how much of the observed variance is explained by that feature. Obviously, the sum of all these values must be <= 1.
I find this attribute very useful when performing feature engineering.
Thanks to the scikit-learn team and contributors for implementing this!
edit: This works for both RandomForest and GradientBoosting. So RandomForestClassifier
, RandomForestRegressor
, GradientBoostingClassifier
and GradientBoostingRegressor
all support this.
We’ve recently released a library (https://github.com/TeamHG-Memex/eli5) which allows to do that: it handles variuos classifiers from scikit-learn, binary / multiclass cases, allows to highlight text according to feature values, integrates with IPython, etc.
I actually had to find out Feature Importance on my NaiveBayes classifier and although I used the above functions, I was not able to get feature importance based on classes. I went through the scikit-learn’s documentation and tweaked the above functions a bit to find it working for my problem. Hope it helps you too!
def important_features(vectorizer,classifier,n=20):
class_labels = classifier.classes_
feature_names =vectorizer.get_feature_names()
topn_class1 = sorted(zip(classifier.feature_count_[0], feature_names),reverse=True)[:n]
topn_class2 = sorted(zip(classifier.feature_count_[1], feature_names),reverse=True)[:n]
print("Important words in negative reviews")
for coef, feat in topn_class1:
print(class_labels[0], coef, feat)
print("-----------------------------------------")
print("Important words in positive reviews")
for coef, feat in topn_class2:
print(class_labels[1], coef, feat)
Note that your classifier(in my case it’s NaiveBayes) must have attribute feature_count_ for this to work.
Not exactly what you are looking for, but a quick way to get the largest magnitude coefficients (assuming a pandas dataframe columns are your feature names):
You trained the model like:
lr = LinearRegression()
X_train, X_test, y_train, y_test = train_test_split(df, Y, test_size=0.25)
lr.fit(X_train, y_train)
Get the 10 largest negative coefficient values (or change to reverse=True for largest positive) like:
sorted(list(zip(feature_df.columns, lr.coef_)), key=lambda x: x[1],
reverse=False)[:10]
First make a list, I give this list the name label. After that extracting all features name and column name I add in label list. Here I use naive bayes model. In naive bayes model, feature_log_prob_ give probability of features.
def top20(model,label):
feature_prob=(abs(model.feature_log_prob_))
for i in range(len(feature_prob)):
print ('top 20 features for {} class'.format(i))
clas = feature_prob[i,:]
dictonary={}
for count,ele in enumerate(clas,0):
dictonary[count]=ele
dictonary=dict(sorted(dictonary.items(), key=lambda x: x[1], reverse=True)[:20])
keys=list(dictonary.keys())
for i in keys:
print(label[i])
print('*'*1000)
The classifiers in machine learning packages like liblinear and nltk offer a method show_most_informative_features()
, which is really helpful for debugging features:
viagra = None ok : spam = 4.5 : 1.0
hello = True ok : spam = 4.5 : 1.0
hello = None spam : ok = 3.3 : 1.0
viagra = True spam : ok = 3.3 : 1.0
casino = True spam : ok = 2.0 : 1.0
casino = None ok : spam = 1.5 : 1.0
My question is if something similar is implemented for the classifiers in scikit-learn. I searched the documentation, but couldn’t find anything the like.
If there is no such function yet, does somebody know a workaround how to get to those values?
The classifiers themselves do not record feature names, they just see numeric arrays. However, if you extracted your features using a Vectorizer
/CountVectorizer
/TfidfVectorizer
/DictVectorizer
, and you are using a linear model (e.g. LinearSVC
or Naive Bayes) then you can apply the same trick that the document classification example uses. Example (untested, may contain a bug or two):
def print_top10(vectorizer, clf, class_labels):
"""Prints features with the highest coefficient values, per class"""
feature_names = vectorizer.get_feature_names()
for i, class_label in enumerate(class_labels):
top10 = np.argsort(clf.coef_[i])[-10:]
print("%s: %s" % (class_label,
" ".join(feature_names[j] for j in top10)))
This is for multiclass classification; for the binary case, I think you should use clf.coef_[0]
only. You may have to sort the class_labels
.
With the help of larsmans code I came up with this code for the binary case:
def show_most_informative_features(vectorizer, clf, n=20):
feature_names = vectorizer.get_feature_names()
coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
for (coef_1, fn_1), (coef_2, fn_2) in top:
print "t%.4ft%-15stt%.4ft%-15s" % (coef_1, fn_1, coef_2, fn_2)
RandomForestClassifier
does not yet have a coef_
attrubute, but it will in the 0.17 release, I think. However, see the RandomForestClassifierWithCoef
class in Recursive feature elimination on Random Forest using scikit-learn. This may give you some ideas to work around the limitation above.
You can also do something like this to create a graph of importance features by order:
importances = clf.feature_importances_
std = np.std([tree.feature_importances_ for tree in clf.estimators_],
axis=0)
indices = np.argsort(importances)[::-1]
# Print the feature ranking
#print("Feature ranking:")
# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(train[features].shape[1]), importances[indices],
color="r", yerr=std[indices], align="center")
plt.xticks(range(train[features].shape[1]), indices)
plt.xlim([-1, train[features].shape[1]])
plt.show()
To add an update, RandomForestClassifier
now supports the .feature_importances_
attribute. This attribute tells you how much of the observed variance is explained by that feature. Obviously, the sum of all these values must be <= 1.
I find this attribute very useful when performing feature engineering.
Thanks to the scikit-learn team and contributors for implementing this!
edit: This works for both RandomForest and GradientBoosting. So RandomForestClassifier
, RandomForestRegressor
, GradientBoostingClassifier
and GradientBoostingRegressor
all support this.
We’ve recently released a library (https://github.com/TeamHG-Memex/eli5) which allows to do that: it handles variuos classifiers from scikit-learn, binary / multiclass cases, allows to highlight text according to feature values, integrates with IPython, etc.
I actually had to find out Feature Importance on my NaiveBayes classifier and although I used the above functions, I was not able to get feature importance based on classes. I went through the scikit-learn’s documentation and tweaked the above functions a bit to find it working for my problem. Hope it helps you too!
def important_features(vectorizer,classifier,n=20):
class_labels = classifier.classes_
feature_names =vectorizer.get_feature_names()
topn_class1 = sorted(zip(classifier.feature_count_[0], feature_names),reverse=True)[:n]
topn_class2 = sorted(zip(classifier.feature_count_[1], feature_names),reverse=True)[:n]
print("Important words in negative reviews")
for coef, feat in topn_class1:
print(class_labels[0], coef, feat)
print("-----------------------------------------")
print("Important words in positive reviews")
for coef, feat in topn_class2:
print(class_labels[1], coef, feat)
Note that your classifier(in my case it’s NaiveBayes) must have attribute feature_count_ for this to work.
Not exactly what you are looking for, but a quick way to get the largest magnitude coefficients (assuming a pandas dataframe columns are your feature names):
You trained the model like:
lr = LinearRegression()
X_train, X_test, y_train, y_test = train_test_split(df, Y, test_size=0.25)
lr.fit(X_train, y_train)
Get the 10 largest negative coefficient values (or change to reverse=True for largest positive) like:
sorted(list(zip(feature_df.columns, lr.coef_)), key=lambda x: x[1],
reverse=False)[:10]
First make a list, I give this list the name label. After that extracting all features name and column name I add in label list. Here I use naive bayes model. In naive bayes model, feature_log_prob_ give probability of features.
def top20(model,label):
feature_prob=(abs(model.feature_log_prob_))
for i in range(len(feature_prob)):
print ('top 20 features for {} class'.format(i))
clas = feature_prob[i,:]
dictonary={}
for count,ele in enumerate(clas,0):
dictonary[count]=ele
dictonary=dict(sorted(dictonary.items(), key=lambda x: x[1], reverse=True)[:20])
keys=list(dictonary.keys())
for i in keys:
print(label[i])
print('*'*1000)