tf-idf feature weights using sklearn.feature_extraction.text.TfidfVectorizer

python scikit-learn tf-idf

Since version 0.15, the tf-idf score of each feature can be retrieved via the attribute idf_ of the TfidfVectorizer object:

from sklearn.feature_extraction.text import TfidfVectorizercorpus = ["This is very strange",          "This is very nice"]vectorizer = TfidfVectorizer(min_df=1)X = vectorizer.fit_transform(corpus)idf = vectorizer.idf_print dict(zip(vectorizer.get_feature_names(), idf))

Output:

{u'is': 1.0, u'nice': 1.4054651081081644, u'strange': 1.4054651081081644, u'this': 1.0, u'very': 1.0}

As discussed in the comments, prior to version 0.15, a workaround is to access the attribute idf_ via the supposedly hidden _tfidf (an instance of TfidfTransformer) of the vectorizer:

idf = vectorizer._tfidf.idf_print dict(zip(vectorizer.get_feature_names(), idf))

which should give the same output as above.

python scikit-learn tf-idf

See also this on how to get the TF-IDF values of all the documents:

feature_names = tf.get_feature_names()doc = 0feature_index = X[doc,:].nonzero()[1]tfidf_scores = zip(feature_index, [X[doc, x] for x in feature_index])for w, s in [(feature_names[i], s) for (i, s) in tfidf_scores]:    print w, sthis 0.448320873199is 0.448320873199very 0.448320873199strange 0.630099344518#and for doc=1this 0.448320873199is 0.448320873199very 0.448320873199nice 0.630099344518

I think the results are normalized by document:

>>>0.4483208731992+0.4483208731992+0.4483208731992+0.6300993445182 0.9999999999997548

CodeHunter

tf-idf feature weights using sklearn.feature_extraction.text.TfidfVectorizer

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last