Save and reuse TfidfVectorizer in scikit learn
Firstly, it's better to leave the import at the top of your code instead of within your class:
from sklearn.feature_extraction.text import TfidfVectorizerclass changeToMatrix(object): def __init__(self,ngram_range=(1,1),tokenizer=StemTokenizer()): ...
Next StemTokenizer
don't seem to be a canonical class. Possibly you've got it from http://sahandsaba.com/visualizing-philosophers-and-scientists-by-the-words-they-used-with-d3js-and-python.html or maybe somewhere else so we'll assume it returns a list of strings.
class StemTokenizer(object): def __init__(self): self.ignore_set = {'footnote', 'nietzsche', 'plato', 'mr.'} def __call__(self, doc): words = [] for word in word_tokenize(doc): word = word.lower() w = wn.morphy(word) if w and len(w) > 1 and w not in self.ignore_set: words.append(w) return words
Now to answer your actual question, it's possible that you need to open a file in byte mode before dumping a pickle, i.e.:
>>> from sklearn.feature_extraction.text import TfidfVectorizer>>> from nltk import word_tokenize>>> import cPickle as pickle>>> vectorizer = TfidfVectorizer(ngram_range=(0,2),analyzer='word',lowercase=True, token_pattern='[a-zA-Z0-9]+',strip_accents='unicode',tokenizer=word_tokenize)>>> vectorizerTfidfVectorizer(analyzer='word', binary=False, decode_error=u'strict', dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content', lowercase=True, max_df=1.0, max_features=None, min_df=1, ngram_range=(0, 2), norm=u'l2', preprocessor=None, smooth_idf=True, stop_words=None, strip_accents='unicode', sublinear_tf=False, token_pattern='[a-zA-Z0-9]+', tokenizer=<function word_tokenize at 0x7f5ea68e88c0>, use_idf=True, vocabulary=None)>>> with open('vectorizer.pk', 'wb') as fin:... pickle.dump(vectorizer, fin)... >>> exit()alvas@ubi:~$ ls -lah vectorizer.pk -rw-rw-r-- 1 alvas alvas 763 Jun 15 14:18 vectorizer.pk
Note: Using the with
idiom for i/o file access automatically closes the file once you get out of the with
scope.
Regarding the issue with SnowballStemmer()
, note that SnowballStemmer('english')
is an object while the stemming function is SnowballStemmer('english').stem
.
IMPORTANT:
TfidfVectorizer
's tokenizer parameter expects to take a string and return a list of string- But Snowball stemmer does not take a string as input and return a list of string.
So you will need to do this:
>>> from nltk.stem import SnowballStemmer>>> from nltk import word_tokenize>>> stemmer = SnowballStemmer('english').stem>>> def stem_tokenize(text):... return [stemmer(i) for i in word_tokenize(text)]... >>> vectorizer = TfidfVectorizer(ngram_range=(0,2),analyzer='word',lowercase=True, token_pattern='[a-zA-Z0-9]+',strip_accents='unicode',tokenizer=stem_tokenize)>>> with open('vectorizer.pk', 'wb') as fin:... pickle.dump(vectorizer, fin)...>>> exit()alvas@ubi:~$ ls -lah vectorizer.pk -rw-rw-r-- 1 alvas alvas 758 Jun 15 15:55 vectorizer.pk