Understanding min_df and max_df in scikit CountVectorizer

Question:

I have five text files that I input to a CountVectorizer. When specifying min_df and max_df to the CountVectorizer instance what does the min/max document frequency exactly mean? Is it the frequency of a word in its particular text file or is it the frequency of the word in the entire overall corpus (five text files)?

What are the differences when min_df and max_df are provided as integers or as floats?

The documentation doesn’t seem to provide a thorough explanation nor does it supply an example to demonstrate the use of these two parameters. Could someone provide an explanation or example demonstrating min_df and max_df?

Asked By: moeabdol

||

Answers:

As per the CountVectorizer documentation here.

When using a float in the range [0.0, 1.0] they refer to the document frequency. That is the percentage of documents that contain the term.

When using an int it refers to absolute number of documents that hold this term.

Consider the example where you have 5 text files (or documents). If you set max_df = 0.6 then that would translate to 0.6*5=3 documents. If you set max_df = 2 then that would simply translate to 2 documents.

The source code example below is copied from Github here and shows how the max_doc_count is constructed from the max_df. The code for min_df is similar and can be found on the GH page.

max_doc_count = (max_df
                 if isinstance(max_df, numbers.Integral)
                 else max_df * n_doc)

The defaults for min_df and max_df are 1 and 1.0, respectively. This basically says “If my term is found in only 1 document, then it’s ignored. Similarly if it’s found in all documents (100% or 1.0) then it’s ignored.”

max_df and min_df are both used internally to calculate max_doc_count and min_doc_count, the maximum and minimum number of documents that a term must be found in. This is then passed to self._limit_features as the keyword arguments high and low respectively, the docstring for self._limit_features is

"""Remove too rare or too common features.

Prune features that are non zero in more samples than high or less
documents than low, modifying the vocabulary, and restricting it to
at most the limit most frequent.

This does not prune samples with zero features.
"""
Answered By: Ffisegydd

The defaults for min_df and max_df are 1 and 1.0, respectively. These defaults really don’t do anything at all.

That being said, I believe the currently accepted answer by @Ffisegydd answer isn’t quite correct.

For example, run this using the defaults, to see that when min_df=1 and max_df=1.0, then

1) all tokens that appear in at least one document are used (e.g., all tokens!)

2) all tokens that appear in all documents are used (we’ll test with one candidate: everywhere).

cv = CountVectorizer(min_df=1, max_df=1.0, lowercase=True) 
# here is just a simple list of 3 documents.
corpus = ['one two three everywhere', 'four five six everywhere', 'seven eight nine everywhere']
# below we call fit_transform on the corpus and get the feature names.
X = cv.fit_transform(corpus)
vocab = cv.get_feature_names()
print vocab
print X.toarray()
print cv.stop_words_

We get:

[u'eight', u'everywhere', u'five', u'four', u'nine', u'one', u'seven', u'six', u'three', u'two']
[[0 1 0 0 0 1 0 0 1 1]
 [0 1 1 1 0 0 0 1 0 0]
 [1 1 0 0 1 0 1 0 0 0]]
set([])

All tokens are kept. There are no stopwords.

Further messing around with the arguments will clarify other configurations.

For fun and insight, I’d also recommend playing around with stop_words = 'english' and seeing that, peculiarly, all the words except ‘seven’ are removed! Including `everywhere’.

Answered By: Monica Heddneck

max_df is used for removing terms that appear too frequently, also known as “corpus-specific stop words”. For example:

  • max_df = 0.50 means “ignore terms that appear in more than 50% of the documents“.
  • max_df = 25 means “ignore terms that appear in more than 25 documents“.

The default max_df is 1.0, which means “ignore terms that appear in more than 100% of the documents“. Thus, the default setting does not ignore any terms.


min_df is used for removing terms that appear too infrequently. For example:

  • min_df = 0.01 means “ignore terms that appear in less than 1% of the documents“.
  • min_df = 5 means “ignore terms that appear in less than 5 documents“.

The default min_df is 1, which means “ignore terms that appear in less than 1 document“. Thus, the default setting does not ignore any terms.

Answered By: Kevin Markham

I would add this point also for understanding min_df and max_df in tf-idf better.

If you go with the default values, meaning considering all terms, you have generated definitely more tokens. So your clustering process (or any other thing you want to do with those terms later) will take a longer time.

BUT the quality of your clustering should NOT be reduced.

One might think that allowing all terms (e.g. too frequent terms or stop-words) to be present might lower the quality but in tf-idf it doesn’t. Because tf-idf measurement instinctively will give a low score to those terms, effectively making them not influential (as they appear in many documents).

So to sum it up, pruning the terms via min_df and max_df is to improve the performance, not the quality of clusters (as an example).

And the crucial point is that if you set the min and max mistakenly, you would lose some important terms and thus lower the quality. So if you are unsure about the right threshold (it depends on your documents set), or if you are sure about your machine’s processing capabilities, leave the min, max parameters unchanged.

Answered By: Amirabbas Askary

The goal of MIN_DF is to ignore words that have very few occurrences to be considered meaningful. For example, in your text you may have names of people that may appear in only 1 or two documents. In some applications, this may qualify as noise and could be eliminated from further analysis. Similarly, you can ignore words that are too common with MAX_DF.

Instead of using a minimum/maximum term frequency (total occurrences of a word) to eliminate words, MIN_DF and MAX_DF look at how many documents contained a term, better known as document frequency. The threshold values can be an absolute value (e.g. 1, 2, 3, 4) or a value representing proportion of documents (e.g. 0.25 meaning, ignore words that have appeared in 25% of the documents) .

See some usage examples here.

Answered By: dolly

I just looked at the documentation for sklearn CountVectorizer.This is how I think about it.

Common words have higher frequency values, while rare words have lower frequency values. The frequency values range between 0 - 1 as fractions.

max_df is the upper ceiling value of the frequency values, while min_df is just the lower cutoff value of the frequency values.

If we want to remove more common words, we set max_df to a lower ceiling value between 0 and 1. If we want to remove more rare words, we set min_df to a higher cutoff value between 0 and 1. We keep everything between max_df and min_df.

Let me know, not sure if this makes sense.

Answered By: Yi Xiang Chong