Understanding the `ngram_range` argument in a CountVectorizer in sklearn

python scikit-learn n-gram feature-selection

Setting the vocabulary explicitly means no vocabulary is learned from data. If you don't set it, you get:

>>> v = CountVectorizer(ngram_range=(1, 2))>>> pprint(v.fit(["an apple a day keeps the doctor away"]).vocabulary_){u'an': 0, u'an apple': 1, u'apple': 2, u'apple day': 3, u'away': 4, u'day': 5, u'day keeps': 6, u'doctor': 7, u'doctor away': 8, u'keeps': 9, u'keeps the': 10, u'the': 11, u'the doctor': 12}

An explicit vocabulary restricts the terms that will be extracted from text; the vocabulary is not changed:

>>> v = CountVectorizer(ngram_range=(1, 2), vocabulary={"keeps", "keeps the"})>>> v.fit_transform(["an apple a day keeps the doctor away"]).toarray()array([[1, 1]])  # unigram and bigram found

(Note that stopword filtering is applied before n-gram extraction, hence "apple day".)

CodeHunter

Understanding the `ngram_range` argument in a CountVectorizer in sklearn

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last