How do I store a TfidfVectorizer for future use in scikit-learn? How do I store a TfidfVectorizer for future use in scikit-learn? python python

How do I store a TfidfVectorizer for future use in scikit-learn?

You can simply use the built in pickle library:

import picklepickle.dump(vectorizer, open("vectorizer.pickle", "wb"))pickle.dump(selector, open("selector.pickle", "wb"))

and load it with:

vectorizer = pickle.load(open("vectorizer.pickle", "rb"))selector = pickle.load(open("selector.pickle", "rb"))

Pickle will serialize the objects to disk and load them in memory again when you need it

pickle lib docs

Here is my answer using joblib:

import joblibjoblib.dump(vectorizer, 'vectorizer.pkl')joblib.dump(selector, 'selector.pkl')

Later, I can load it and ready to go:

vectorizer = joblib.load('vectorizer.pkl')selector = joblib.load('selector.pkl')test = selector.trasnform(vectorizer.transform(['this is test']))

"Making an object persistent" basically means that you're going to dump the binary code stored in memory that represents the object in a file on the hard-drive, so that later on in your program or in any other program the object can be reloaded from the file in the hard drive into memory.

Either scikit-learn included joblib or the stdlib pickle and cPickle would do the job.I tend to prefer cPickle because it is significantly faster. Using ipython's %timeit command:

>>> from sklearn.feature_extraction.text import TfidfVectorizer as TFIDF>>> t = TFIDF()>>> t.fit_transform(['hello world'], ['this is a test'])# generic serializer - deserializer test>>> def dump_load_test(tfidf, serializer):...:    with open('vectorizer.bin', 'w') as f:...:        serializer.dump(tfidf, f)...:    with open('vectorizer.bin', 'r') as f:...:        return serializer.load(f)# joblib has a slightly different interface>>> def joblib_test(tfidf):...:    joblib.dump(tfidf, 'tfidf.bin')...:    return joblib.load('tfidf.bin')# Now, time it!>>> %timeit joblib_test(t)100 loops, best of 3: 3.09 ms per loop>>> %timeit dump_load_test(t, pickle)100 loops, best of 3: 2.16 ms per loop>>> %timeit dump_load_test(t, cPickle)1000 loops, best of 3: 879 µs per loop

Now if you want to store multiple objects in a single file, you can easily create a data structure to store them, then dump the data structure itself. This will work with tuple, list or dict.From the example of your question:

# trainvectorizer = TfidfVectorizer()X_train = vectorizer.fit_transform(corpus)selector = SelectKBest(chi2, k = 5000 )X_train_sel = selector.fit_transform(X_train, y_train)# dump as a dictdata_struct = {'vectorizer': vectorizer, 'selector': selector}# use the 'with' keyword to automatically close the file after the dumpwith open('storage.bin', 'wb') as f:     cPickle.dump(data_struct, f)

Later or in another program, the following statements will bring back the data structure in your program's memory:

# reloadwith open('storage.bin', 'rb') as f:    data_struct = cPickle.load(f)    vectorizer, selector = data_struct['vectorizer'], data_struct['selector']# do stuff...vectors = vectorizer.transform(...)vec_sel = selector.transform(vectors)