Why is the value of TF-IDF different from IDF_?

Question:

Why is the value of the vectorized corpus different from the value obtained through the idf_ attribute? Should not the idf_ attribute just return the inverse document frequency (IDF) in the same way it appears in the corpus vectorized?

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["This is very strange",
          "This is very nice"]
vectorizer = TfidfVectorizer()
corpus = vectorizer.fit_transform(corpus)

print(corpus)

Corpus vectorized:

  (0, 2)    0.6300993445179441
  (0, 4)    0.44832087319911734
  (0, 0)    0.44832087319911734
  (0, 3)    0.44832087319911734
  (1, 1)    0.6300993445179441
  (1, 4)    0.44832087319911734
  (1, 0)    0.44832087319911734
  (1, 3)    0.44832087319911734

Vocabulary and idf_ values:

print(dict(zip(vectorizer.vocabulary_, vectorizer.idf_)))

Output:

{'this': 1.0, 
 'is': 1.4054651081081644, 
 'very': 1.4054651081081644, 
 'strange': 1.0, 
 'nice': 1.0}

Vocabulary index:

print(vectorizer.vocabulary_)

Output:

{'this': 3, 
 'is': 0, 
 'very': 4, 
 'strange': 2, 
 'nice': 1}

Why is the IDF value of the word this is 0.44 in the corpus and 1.0 when obtained by idf_?

Asked By: dasilvadaniel

||

Answers:

This is because of l2 normalization, which is applied by default in TfidfVectorizer().
If you set the norm param as None, you will get the same values as idf_.


>>> vectorizer = TfidfVectorizer(norm=None)

#output

  (0, 2)    1.4054651081081644
  (0, 4)    1.0
  (0, 0)    1.0
  (0, 3)    1.0
  (1, 1)    1.4054651081081644
  (1, 4)    1.0
  (1, 0)    1.0
  (1, 3)    1.0

Also, your way of computing the feature’s corresponding idf values is wrong because dict does not preserve the order.

You could use the following method:

 >>>> print(dict(zip(vectorizer.get_feature_names(), vectorizer.idf_)))
      
     {'is': 1.0,
      'nice': 1.4054651081081644, 
      'strange': 1.4054651081081644, 
      'this': 1.0, 
      'very': 1.0}
Answered By: Venkatachalam