Save and reuse TfidfVectorizer in scikit learn

python nlp scikit-learn pickle text-mining

Firstly, it's better to leave the import at the top of your code instead of within your class:

from sklearn.feature_extraction.text import TfidfVectorizerclass changeToMatrix(object):  def __init__(self,ngram_range=(1,1),tokenizer=StemTokenizer()):    ...

Next StemTokenizer don't seem to be a canonical class. Possibly you've got it from http://sahandsaba.com/visualizing-philosophers-and-scientists-by-the-words-they-used-with-d3js-and-python.html or maybe somewhere else so we'll assume it returns a list of strings.

class StemTokenizer(object):    def __init__(self):        self.ignore_set = {'footnote', 'nietzsche', 'plato', 'mr.'}    def __call__(self, doc):        words = []        for word in word_tokenize(doc):            word = word.lower()            w = wn.morphy(word)            if w and len(w) > 1 and w not in self.ignore_set:                words.append(w)        return words

Now to answer your actual question, it's possible that you need to open a file in byte mode before dumping a pickle, i.e.:

>>> from sklearn.feature_extraction.text import TfidfVectorizer>>> from nltk import word_tokenize>>> import cPickle as pickle>>> vectorizer = TfidfVectorizer(ngram_range=(0,2),analyzer='word',lowercase=True, token_pattern='[a-zA-Z0-9]+',strip_accents='unicode',tokenizer=word_tokenize)>>> vectorizerTfidfVectorizer(analyzer='word', binary=False, decode_error=u'strict',        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',        lowercase=True, max_df=1.0, max_features=None, min_df=1,        ngram_range=(0, 2), norm=u'l2', preprocessor=None, smooth_idf=True,        stop_words=None, strip_accents='unicode', sublinear_tf=False,        token_pattern='[a-zA-Z0-9]+',        tokenizer=<function word_tokenize at 0x7f5ea68e88c0>, use_idf=True,        vocabulary=None)>>> with open('vectorizer.pk', 'wb') as fin:...     pickle.dump(vectorizer, fin)... >>> exit()alvas@ubi:~$ ls -lah vectorizer.pk -rw-rw-r-- 1 alvas alvas 763 Jun 15 14:18 vectorizer.pk

Note: Using the with idiom for i/o file access automatically closes the file once you get out of the with scope.

Regarding the issue with SnowballStemmer(), note that SnowballStemmer('english') is an object while the stemming function is SnowballStemmer('english').stem.

IMPORTANT:

TfidfVectorizer's tokenizer parameter expects to take a string and return a list of string
But Snowball stemmer does not take a string as input and return a list of string.

So you will need to do this:

>>> from nltk.stem import SnowballStemmer>>> from nltk import word_tokenize>>> stemmer = SnowballStemmer('english').stem>>> def stem_tokenize(text):...     return [stemmer(i) for i in word_tokenize(text)]... >>> vectorizer = TfidfVectorizer(ngram_range=(0,2),analyzer='word',lowercase=True, token_pattern='[a-zA-Z0-9]+',strip_accents='unicode',tokenizer=stem_tokenize)>>> with open('vectorizer.pk', 'wb') as fin:...     pickle.dump(vectorizer, fin)...>>> exit()alvas@ubi:~$ ls -lah vectorizer.pk -rw-rw-r-- 1 alvas alvas 758 Jun 15 15:55 vectorizer.pk

CodeHunter

Save and reuse TfidfVectorizer in scikit learn

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last