Keep TFIDF result for predicting new content using Scikit for Python

python machine-learning scikit-learn tf-idf

I successfully saved the feature list by saving vectorizer.vocabulary_, and reuse by CountVectorizer(decode_error="replace",vocabulary=vectorizer.vocabulary_)

Codes below:

corpus = np.array(["aaa bbb ccc", "aaa bbb ddd"])vectorizer = CountVectorizer(decode_error="replace")vec_train = vectorizer.fit_transform(corpus)#Save vectorizer.vocabulary_pickle.dump(vectorizer.vocabulary_,open("feature.pkl","wb"))#Load it latertransformer = TfidfTransformer()loaded_vec = CountVectorizer(decode_error="replace",vocabulary=pickle.load(open("feature.pkl", "rb")))tfidf = transformer.fit_transform(loaded_vec.fit_transform(np.array(["aaa ccc eee"])))

That works. tfidf will have same feature length as trained data.

python machine-learning scikit-learn tf-idf

Instead of using the CountVectorizer for storing the vocabulary, the vocabulary of the tfidfvectorizer can be used directly.

Training phase:

from sklearn.feature_extraction.text import TfidfVectorizer# tf-idf based vectorstf = TfidfVectorizer(analyzer='word', ngram_range=(1,2), stop_words = "english", lowercase = True, max_features = 500000)# Fit the modeltf_transformer = tf.fit(corpus)# Dump the filepickle.dump(tf_transformer, open("tfidf1.pkl", "wb"))# Testing phasetf1 = pickle.load(open("tfidf1.pkl", 'rb'))# Create new tfidfVectorizer with old vocabularytf1_new = TfidfVectorizer(analyzer='word', ngram_range=(1,2), stop_words = "english", lowercase = True,                          max_features = 500000, vocabulary = tf1.vocabulary_)X_tf1 = tf1_new.fit_transform(new_corpus)

The fit_transform works here as we are using the old vocabulary. If you were not storing the tfidf, you would have just used transform on the test data. Even when you are doing a transform there, the new documents from the test data are being "fit" to the vocabulary of the vectorizer of the train. That is exactly what we are doing here. The only thing we can store and re-use for a tfidf vectorizer is the vocabulary.

python machine-learning scikit-learn tf-idf

If you want to store features list for testing data for use in future, you can do this:

tfidf = transformer.fit_transform(vectorizer.fit_transform(corpus))#store the contentwith open("x_result.pkl", 'wb') as handle:                    pickle.dump(tfidf, handle)#load the contenttfidf = pickle.load(open("x_result.pkl", "rb" ) )

CodeHunter

Keep TFIDF result for predicting new content using Scikit for Python

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last