Can I use CountVectorizer in scikit-learn to count frequency of documents that were not used to extract the tokens?

python machine-learning scikit-learn tf-idf

You're right that vocabulary is what you want. It works like this:

>>> cv = sklearn.feature_extraction.text.CountVectorizer(vocabulary=['hot', 'cold', 'old'])>>> cv.fit_transform(['pease porridge hot', 'pease porridge cold', 'pease porridge in the pot', 'nine days old']).toarray()array([[1, 0, 0],       [0, 1, 0],       [0, 0, 0],       [0, 0, 1]], dtype=int64)

So you pass it a dict with your desired features as the keys.

If you used CountVectorizer on one set of documents and then you want to use the set of features from those documents for a new set, use the vocabulary_ attribute of your original CountVectorizer and pass it to the new one. So in your example, you could do

newVec = CountVectorizer(vocabulary=vec.vocabulary_)

to create a new tokenizer using the vocabulary from your first one.

python machine-learning scikit-learn tf-idf

You should call fit_transform or just fit on your original vocabulary source so that the vectorizer learns a vocab.

Then you can use this fit vectorizer on any new data source via the transform() method.

You can obtain the vocabulary produced by the fit (i.e. mapping of word to token ID) via vectorizer.vocabulary_ (assuming you name your CountVectorizer the name vectorizer.

python machine-learning scikit-learn tf-idf

>>> tags = [  "python, tools",  "linux, tools, ubuntu",  "distributed systems, linux, networking, tools",]>>> list_of_new_documents = [  ["python, chicken"],  ["linux, cow, ubuntu"],  ["machine learning, bird, fish, pig"]]>>> from sklearn.feature_extraction.text import CountVectorizer>>> vect = CountVectorizer()>>> tags = vect.fit_transform(tags)# vocabulary learned by CountVectorizer (vect)>>> print(vect.vocabulary_){'python': 3, 'tools': 5, 'linux': 1, 'ubuntu': 6, 'distributed': 0, 'systems': 4, 'networking': 2}# counts for tags>>> tags.toarray()array([[0, 0, 0, 1, 0, 1, 0],       [0, 1, 0, 0, 0, 1, 1],       [1, 1, 1, 0, 1, 1, 0]], dtype=int64)# to use `transform`, `list_of_new_documents` should be a list of strings # `itertools.chain` flattens shallow lists more efficiently than list comprehensions>>> from itertools import chain>>> new_docs = list(chain.from_iterable(list_of_new_documents)>>> new_docs = vect.transform(new_docs)# finally, counts for new_docs!>>> new_docs.toarray()array([[0, 0, 0, 1, 0, 0, 0],       [0, 1, 0, 0, 0, 0, 1],       [0, 0, 0, 0, 0, 0, 0]])

To verify that CountVectorizer is using the vocabulary learned from tags on new_docs: print vect.vocabulary_ again or compare the output of new_docs.toarray() to that of tags.toarray()

CodeHunter

Can I use CountVectorizer in scikit-learn to count frequency of documents that were not used to extract the tokens?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last