Understanding min_df and max_df in scikit CountVectorizer

python machine-learning scikit-learn nlp

max_df is used for removing terms that appear too frequently, also known as "corpus-specific stop words". For example:

max_df = 0.50 means "ignore terms that appear in more than 50% of the documents".
max_df = 25 means "ignore terms that appear in more than 25 documents".

The default max_df is 1.0, which means "ignore terms that appear in more than 100% of the documents". Thus, the default setting does not ignore any terms.

min_df is used for removing terms that appear too infrequently. For example:

min_df = 0.01 means "ignore terms that appear in less than 1% of the documents".
min_df = 5 means "ignore terms that appear in less than 5 documents".

The default min_df is 1, which means "ignore terms that appear in less than 1 document". Thus, the default setting does not ignore any terms.

python machine-learning scikit-learn nlp

As per the CountVectorizer documentation here.

When using a float in the range [0.0, 1.0] they refer to the document frequency. That is the percentage of documents that contain the term.

When using an int it refers to absolute number of documents that hold this term.

Consider the example where you have 5 text files (or documents). If you set max_df = 0.6 then that would translate to 0.6*5=3 documents. If you set max_df = 2 then that would simply translate to 2 documents.

The source code example below is copied from Github here and shows how the max_doc_count is constructed from the max_df. The code for min_df is similar and can be found on the GH page.

max_doc_count = (max_df                 if isinstance(max_df, numbers.Integral)                 else max_df * n_doc)

The defaults for min_df and max_df are 1 and 1.0, respectively. This basically says "If my term is found in only 1 document, then it's ignored. Similarly if it's found in all documents (100% or 1.0) then it's ignored."

max_df and min_df are both used internally to calculate max_doc_count and min_doc_count, the maximum and minimum number of documents that a term must be found in. This is then passed to self._limit_features as the keyword arguments high and low respectively, the docstring for self._limit_features is

"""Remove too rare or too common features.Prune features that are non zero in more samples than high or lessdocuments than low, modifying the vocabulary, and restricting it toat most the limit most frequent.This does not prune samples with zero features."""

python machine-learning scikit-learn nlp

I would add this point also for understanding min_df and max_df in tf-idf better.

If you go with the default values, meaning considering all terms, you have generated definitely more tokens. So your clustering process (or any other thing you want to do with those terms later) will take a longer time.

BUT the quality of your clustering should NOT be reduced.

One might think that allowing all terms (e.g. too frequent terms or stop-words) to be present might lower the quality but in tf-idf it doesn't. Because tf-idf measurement instinctively will give a low score to those terms, effectively making them not influential (as they appear in many documents).

So to sum it up, pruning the terms via min_df and max_df is to improve the performance, not the quality of clusters (as an example).

And the crucial point is that if you set the min and max mistakenly, you would lose some important terms and thus lower the quality. So if you are unsure about the right threshold (it depends on your documents set), or if you are sure about your machine's processing capabilities, leave the min, max parameters unchanged.

CodeHunter

Understanding min_df and max_df in scikit CountVectorizer

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last