How to see top n entries of term-document matrix after tfidf in scikit-learn

python numpy scikit-learn tf-idf top-n

Since version 0.15, the global term weighting of the features learnt by a TfidfVectorizer can be accessed through the attribute idf_, which will return an array of length equal to the feature dimension. Sort the features by this weighting to get the top weighted features:

from sklearn.feature_extraction.text import TfidfVectorizerimport numpy as nplectures = ["this is some food", "this is some drink"]vectorizer = TfidfVectorizer()X = vectorizer.fit_transform(lectures)indices = np.argsort(vectorizer.idf_)[::-1]features = vectorizer.get_feature_names()top_n = 2top_features = [features[i] for i in indices[:top_n]]print top_features

Output:

[u'food', u'drink']

The second problem of getting the top features by ngram can be done using the same idea, with some extra steps of splitting the features into different groups:

from sklearn.feature_extraction.text import TfidfVectorizerfrom collections import defaultdictlectures = ["this is some food", "this is some drink"]vectorizer = TfidfVectorizer(ngram_range=(1,2))X = vectorizer.fit_transform(lectures)features_by_gram = defaultdict(list)for f, w in zip(vectorizer.get_feature_names(), vectorizer.idf_):    features_by_gram[len(f.split(' '))].append((f, w))top_n = 2for gram, features in features_by_gram.iteritems():    top_features = sorted(features, key=lambda x: x[1], reverse=True)[:top_n]    top_features = [f[0] for f in top_features]    print '{}-gram top:'.format(gram), top_features

Output:

1-gram top: [u'drink', u'food']2-gram top: [u'some drink', u'some food']

CodeHunter

How to see top n entries of term-document matrix after tfidf in scikit-learn

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last