How to get most informative features for scikit-learn classifiers?
The classifiers themselves do not record feature names, they just see numeric arrays. However, if you extracted your features using a Vectorizer
/CountVectorizer
/TfidfVectorizer
/DictVectorizer
, and you are using a linear model (e.g. LinearSVC
or Naive Bayes) then you can apply the same trick that the document classification example uses. Example (untested, may contain a bug or two):
def print_top10(vectorizer, clf, class_labels): """Prints features with the highest coefficient values, per class""" feature_names = vectorizer.get_feature_names() for i, class_label in enumerate(class_labels): top10 = np.argsort(clf.coef_[i])[-10:] print("%s: %s" % (class_label, " ".join(feature_names[j] for j in top10)))
This is for multiclass classification; for the binary case, I think you should use clf.coef_[0]
only. You may have to sort the class_labels
.
With the help of larsmans code I came up with this code for the binary case:
def show_most_informative_features(vectorizer, clf, n=20): feature_names = vectorizer.get_feature_names() coefs_with_fns = sorted(zip(clf.coef_[0], feature_names)) top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1]) for (coef_1, fn_1), (coef_2, fn_2) in top: print "\t%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1, fn_1, coef_2, fn_2)
To add an update, RandomForestClassifier
now supports the .feature_importances_
attribute. This attribute tells you how much of the observed variance is explained by that feature. Obviously, the sum of all these values must be <= 1.
I find this attribute very useful when performing feature engineering.
Thanks to the scikit-learn team and contributors for implementing this!
edit: This works for both RandomForest and GradientBoosting. So RandomForestClassifier
, RandomForestRegressor
, GradientBoostingClassifier
and GradientBoostingRegressor
all support this.