How do I store a TfidfVectorizer for future use in scikit-learn?
You can simply use the built in pickle library:
import picklepickle.dump(vectorizer, open("vectorizer.pickle", "wb"))pickle.dump(selector, open("selector.pickle", "wb"))
and load it with:
vectorizer = pickle.load(open("vectorizer.pickle", "rb"))selector = pickle.load(open("selector.pickle", "rb"))
Pickle will serialize the objects to disk and load them in memory again when you need it
Here is my answer using joblib:
import joblibjoblib.dump(vectorizer, 'vectorizer.pkl')joblib.dump(selector, 'selector.pkl')
Later, I can load it and ready to go:
vectorizer = joblib.load('vectorizer.pkl')selector = joblib.load('selector.pkl')test = selector.trasnform(vectorizer.transform(['this is test']))
"Making an object persistent" basically means that you're going to dump the binary code stored in memory that represents the object in a file on the hard-drive, so that later on in your program or in any other program the object can be reloaded from the file in the hard drive into memory.
Either scikit-learn included joblib
or the stdlib pickle
and cPickle
would do the job.I tend to prefer cPickle
because it is significantly faster. Using ipython's %timeit command:
>>> from sklearn.feature_extraction.text import TfidfVectorizer as TFIDF>>> t = TFIDF()>>> t.fit_transform(['hello world'], ['this is a test'])# generic serializer - deserializer test>>> def dump_load_test(tfidf, serializer):...: with open('vectorizer.bin', 'w') as f:...: serializer.dump(tfidf, f)...: with open('vectorizer.bin', 'r') as f:...: return serializer.load(f)# joblib has a slightly different interface>>> def joblib_test(tfidf):...: joblib.dump(tfidf, 'tfidf.bin')...: return joblib.load('tfidf.bin')# Now, time it!>>> %timeit joblib_test(t)100 loops, best of 3: 3.09 ms per loop>>> %timeit dump_load_test(t, pickle)100 loops, best of 3: 2.16 ms per loop>>> %timeit dump_load_test(t, cPickle)1000 loops, best of 3: 879 µs per loop
Now if you want to store multiple objects in a single file, you can easily create a data structure to store them, then dump the data structure itself. This will work with tuple
, list
or dict
.From the example of your question:
# trainvectorizer = TfidfVectorizer()X_train = vectorizer.fit_transform(corpus)selector = SelectKBest(chi2, k = 5000 )X_train_sel = selector.fit_transform(X_train, y_train)# dump as a dictdata_struct = {'vectorizer': vectorizer, 'selector': selector}# use the 'with' keyword to automatically close the file after the dumpwith open('storage.bin', 'wb') as f: cPickle.dump(data_struct, f)
Later or in another program, the following statements will bring back the data structure in your program's memory:
# reloadwith open('storage.bin', 'rb') as f: data_struct = cPickle.load(f) vectorizer, selector = data_struct['vectorizer'], data_struct['selector']# do stuff...vectors = vectorizer.transform(...)vec_sel = selector.transform(vectors)