Keep TFIDF result for predicting new content using Scikit for Python Keep TFIDF result for predicting new content using Scikit for Python python python

Keep TFIDF result for predicting new content using Scikit for Python


I successfully saved the feature list by saving vectorizer.vocabulary_, and reuse by CountVectorizer(decode_error="replace",vocabulary=vectorizer.vocabulary_)

Codes below:

corpus = np.array(["aaa bbb ccc", "aaa bbb ddd"])vectorizer = CountVectorizer(decode_error="replace")vec_train = vectorizer.fit_transform(corpus)#Save vectorizer.vocabulary_pickle.dump(vectorizer.vocabulary_,open("feature.pkl","wb"))#Load it latertransformer = TfidfTransformer()loaded_vec = CountVectorizer(decode_error="replace",vocabulary=pickle.load(open("feature.pkl", "rb")))tfidf = transformer.fit_transform(loaded_vec.fit_transform(np.array(["aaa ccc eee"])))

That works. tfidf will have same feature length as trained data.


Instead of using the CountVectorizer for storing the vocabulary, the vocabulary of the tfidfvectorizer can be used directly.

Training phase:

from sklearn.feature_extraction.text import TfidfVectorizer# tf-idf based vectorstf = TfidfVectorizer(analyzer='word', ngram_range=(1,2), stop_words = "english", lowercase = True, max_features = 500000)# Fit the modeltf_transformer = tf.fit(corpus)# Dump the filepickle.dump(tf_transformer, open("tfidf1.pkl", "wb"))# Testing phasetf1 = pickle.load(open("tfidf1.pkl", 'rb'))# Create new tfidfVectorizer with old vocabularytf1_new = TfidfVectorizer(analyzer='word', ngram_range=(1,2), stop_words = "english", lowercase = True,                          max_features = 500000, vocabulary = tf1.vocabulary_)X_tf1 = tf1_new.fit_transform(new_corpus)

The fit_transform works here as we are using the old vocabulary. If you were not storing the tfidf, you would have just used transform on the test data. Even when you are doing a transform there, the new documents from the test data are being "fit" to the vocabulary of the vectorizer of the train. That is exactly what we are doing here. The only thing we can store and re-use for a tfidf vectorizer is the vocabulary.


If you want to store features list for testing data for use in future, you can do this:

tfidf = transformer.fit_transform(vectorizer.fit_transform(corpus))#store the contentwith open("x_result.pkl", 'wb') as handle:                    pickle.dump(tfidf, handle)#load the contenttfidf = pickle.load(open("x_result.pkl", "rb" ) )