Can I use CountVectorizer in scikit-learn to count frequency of documents that were not used to extract the tokens? Can I use CountVectorizer in scikit-learn to count frequency of documents that were not used to extract the tokens? python python

Can I use CountVectorizer in scikit-learn to count frequency of documents that were not used to extract the tokens?


You're right that vocabulary is what you want. It works like this:

>>> cv = sklearn.feature_extraction.text.CountVectorizer(vocabulary=['hot', 'cold', 'old'])>>> cv.fit_transform(['pease porridge hot', 'pease porridge cold', 'pease porridge in the pot', 'nine days old']).toarray()array([[1, 0, 0],       [0, 1, 0],       [0, 0, 0],       [0, 0, 1]], dtype=int64)

So you pass it a dict with your desired features as the keys.

If you used CountVectorizer on one set of documents and then you want to use the set of features from those documents for a new set, use the vocabulary_ attribute of your original CountVectorizer and pass it to the new one. So in your example, you could do

newVec = CountVectorizer(vocabulary=vec.vocabulary_)

to create a new tokenizer using the vocabulary from your first one.


You should call fit_transform or just fit on your original vocabulary source so that the vectorizer learns a vocab.

Then you can use this fit vectorizer on any new data source via the transform() method.

You can obtain the vocabulary produced by the fit (i.e. mapping of word to token ID) via vectorizer.vocabulary_ (assuming you name your CountVectorizer the name vectorizer.


>>> tags = [  "python, tools",  "linux, tools, ubuntu",  "distributed systems, linux, networking, tools",]>>> list_of_new_documents = [  ["python, chicken"],  ["linux, cow, ubuntu"],  ["machine learning, bird, fish, pig"]]>>> from sklearn.feature_extraction.text import CountVectorizer>>> vect = CountVectorizer()>>> tags = vect.fit_transform(tags)# vocabulary learned by CountVectorizer (vect)>>> print(vect.vocabulary_){'python': 3, 'tools': 5, 'linux': 1, 'ubuntu': 6, 'distributed': 0, 'systems': 4, 'networking': 2}# counts for tags>>> tags.toarray()array([[0, 0, 0, 1, 0, 1, 0],       [0, 1, 0, 0, 0, 1, 1],       [1, 1, 1, 0, 1, 1, 0]], dtype=int64)# to use `transform`, `list_of_new_documents` should be a list of strings # `itertools.chain` flattens shallow lists more efficiently than list comprehensions>>> from itertools import chain>>> new_docs = list(chain.from_iterable(list_of_new_documents)>>> new_docs = vect.transform(new_docs)# finally, counts for new_docs!>>> new_docs.toarray()array([[0, 0, 0, 1, 0, 0, 0],       [0, 1, 0, 0, 0, 0, 1],       [0, 0, 0, 0, 0, 0, 0]])

To verify that CountVectorizer is using the vocabulary learned from tags on new_docs: print vect.vocabulary_ again or compare the output of new_docs.toarray() to that of tags.toarray()