Is smooth_idf redundant?

Question:

The scikit-learn documentation says

If smooth_idf=True (the default), the constant “1” is added to the numerator and denominator of the idf as if an extra document was seen containing every term in the collection exactly once, which prevents zero divisions: idf(d, t) = log [ (1 + n) / (1 + df(d, t)) ] + 1.

However, why would df(d, t) = 0? If a term doesn’t occur in any text, the dictionary wouldn’t have the term in the first place, would it?

Asked By: yhylord

||

Answers:

This feature is useful in TfidfVectorizer. According to documentation, this class can be provided with predefined vocabulary. If a word from vocabulary was never seen in the train data, but occures in the test, smooth_idf allows it to be successfully processed.

train_texts = ['apple mango', 'mango banana']
test_texts = ['apple banana', 'mango orange']
vocab = ['apple', 'mango', 'banana', 'orange']
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer1 = TfidfVectorizer(smooth_idf=True, vocabulary=vocab).fit(train_texts)
vectorizer2 = TfidfVectorizer(smooth_idf=False, vocabulary=vocab).fit(train_texts)
print(vectorizer1.transform(test_texts).todense()) # works okay
print(vectorizer2.transform(test_texts).todense()) # raises a ValueError

Output:

[[ 0.70710678  0.          0.70710678  0.        ]
 [ 0.          0.43016528  0.          0.90275015]]
...
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
Answered By: David Dale

Thanks for the answer David Dale.

However, when I run it without specifying a vocabulary, it seems to not matter if smooth_idf is set to False.

Is there a situation when we would want to specify a vocabulary?

train_texts = ['apple mango', 'mango banana']
test_texts = ['apple banana', 'mango orange']
vocab = ['apple', 'mango', 'banana', 'orange']
from sklearn.feature_extraction.text import TfidfVectorizer
#vectorizer1 = TfidfVectorizer(smooth_idf=True, vocabulary=vocab).fit(train_texts)
#vectorizer2 = TfidfVectorizer(smooth_idf=False, vocabulary=vocab).fit(train_texts)
vectorizer1 = TfidfVectorizer(smooth_idf=True).fit(train_texts)
vectorizer2 = TfidfVectorizer(smooth_idf=False).fit(train_texts)
print(vectorizer1.transform(test_texts).todense()) # works okay
print(vectorizer2.transform(test_texts).todense()) # also works okay now

Output:

[[0.70710678 0.70710678 0.        ]
 [0.         0.         1.        ]]
[[0.70710678 0.70710678 0.        ]
 [0.         0.         1.        ]]
Answered By: Sam