How to get most informative features for scikit-learn classifiers?

python machine-learning classification scikit-learn

The classifiers themselves do not record feature names, they just see numeric arrays. However, if you extracted your features using a Vectorizer/CountVectorizer/TfidfVectorizer/DictVectorizer, and you are using a linear model (e.g. LinearSVC or Naive Bayes) then you can apply the same trick that the document classification example uses. Example (untested, may contain a bug or two):

def print_top10(vectorizer, clf, class_labels):    """Prints features with the highest coefficient values, per class"""    feature_names = vectorizer.get_feature_names()    for i, class_label in enumerate(class_labels):        top10 = np.argsort(clf.coef_[i])[-10:]        print("%s: %s" % (class_label,              " ".join(feature_names[j] for j in top10)))

This is for multiclass classification; for the binary case, I think you should use clf.coef_[0] only. You may have to sort the class_labels.

python machine-learning classification scikit-learn

With the help of larsmans code I came up with this code for the binary case:

def show_most_informative_features(vectorizer, clf, n=20):    feature_names = vectorizer.get_feature_names()    coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))    top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])    for (coef_1, fn_1), (coef_2, fn_2) in top:        print "\t%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1, fn_1, coef_2, fn_2)

python machine-learning classification scikit-learn

To add an update, RandomForestClassifier now supports the .feature_importances_ attribute. This attribute tells you how much of the observed variance is explained by that feature. Obviously, the sum of all these values must be <= 1.

I find this attribute very useful when performing feature engineering.

Thanks to the scikit-learn team and contributors for implementing this!

edit: This works for both RandomForest and GradientBoosting. So RandomForestClassifier, RandomForestRegressor, GradientBoostingClassifier and GradientBoostingRegressor all support this.

CodeHunter

How to get most informative features for scikit-learn classifiers?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last