Why is my scikit learn HashingVectorizor giving me floats with binary = True set?

Question:

I’m trying to use scikit-learn’s Bernoulli Naive Bayes classifier. I had the classifier working fine on a small data set using CountVectorizor, but ran into trouble when I tried to use HashingVectorizor for working with a larger data set. Holding all other parameters (training documents, test documents, classifier & feature extractor settings) constant and just switching from CountVectorizor to HashingVectorizor caused my classifier to always spit out the same label for all documents.

I wrote the following script to investigate what would be different between the two feature extractors:

from sklearn.feature_extraction.text import HashingVectorizer, CountVectorizer

cv = CountVectorizer(binary=True, decode_error='ignore')
h = HashingVectorizer(binary=True, decode_error='ignore')

with open('moby_dick.txt') as fp:
    doc = fp.read()

cv_result = cv.fit_transform([doc])
h_result = h.transform([doc])

print cv_result
print repr(cv_result)
print h_result
print repr(h_result)

(where ‘moby_dick.txt’ is the project gutenberg copy of moby dick)

The (condensed) results:

  (0, 17319)    1
  (0, 17320)    1
  (0, 17321)    1
<1x17322 sparse matrix of type '<type 'numpy.int64'>'
    with 17322 stored elements in Compressed Sparse Column format>

  (0, 1048456)  0.00763203138591
  (0, 1048503)  0.00763203138591
  (0, 1048519)  0.00763203138591
<1x1048576 sparse matrix of type '<type 'numpy.float64'>'
    with 17168 stored elements in Compressed Sparse Row format>

As you can see, the CountVectorizor, in binary mode, returns integer 1 for the value of every feature(we only expect to see 1 since there’s only one document); the HashVectorizor on the other hand is returning floats (all the same, but different documents produce a different value). I suspect my issues stem from passing these floats onto BernoulliNB.

Ideally, I would like a way to get the same binary format data from HashingVectorizor as I get from CountVectorizor; failing that, I could use the BernoulliNB binarize parameter if I knew a sane threshold to set for this data, but I am not clear on what those floats represent (they’re clearly not token counts, as they’re all the same and less than 1).

Any help would be appreciated.

Asked By: Mark Tozzi

||

Answers:

Under the default settings, HashingVectorizer normalizes your feature vectors to unit Euclidean length:

>>> text = "foo bar baz quux bla"
>>> X = HashingVectorizer(n_features=8).transform([text])
>>> X.toarray()
array([[-0.57735027,  0.        ,  0.        ,  0.        ,  0.57735027,
         0.        , -0.57735027,  0.        ]])
>>> scipy.linalg.norm(np.abs(X.toarray()))
1.0

Setting binary=True only postpones this normalization until after binarizing the features, i.e. setting all the non-zero ones to one. You also have to set norm=None to turn it off:

>>> X = HashingVectorizer(n_features=8, binary=True).transform([text])
>>> X.toarray()
array([[ 0.5,  0. ,  0. ,  0. ,  0.5,  0.5,  0.5,  0. ]])
>>> scipy.linalg.norm(X.toarray())
1.0
>>> X = HashingVectorizer(n_features=8, binary=True, norm=None).transform([text])
>>> X.toarray()
array([[ 1.,  0.,  0.,  0.,  1.,  1.,  1.,  0.]])

This is also why it’s returning float arrays: normalization requires them. While the vectorizer could be rigged to return another dtype, that would require conversion inside the transform method and probably one back to float in the next estimator.

Answered By: Fred Foo

To replace CountVectorizer(binary=True) by HashingVectorizer the proper parameters are: norm=None (default "l2"), alternate_sign=False (default True) and binary=True (default False).

However, if you require the output with the same dtype as from CountVectorizer you can specify dtype="int64" (default "float64").

Furthermore, dtype="uint8" is the optimal dtype when binary=True and will save you a lot of memory:

>>> from sklearn.feature_extraction.text import HashingVectorizer, CountVectorizer
>>> 
>>> cv = CountVectorizer(binary=True)
>>> hv = HashingVectorizer(norm=None, alternate_sign=False, binary=True, dtype='uint8')
>>> 
>>> doc = "one two three two one"
>>> cv_result = cv.fit_transform([doc])
>>> hv_result = hv.transform([doc])
>>> 
>>> print(repr(cv_result))
<1x3 sparse matrix of type '<class 'numpy.int64'>'
    with 3 stored elements in Compressed Sparse Row format>
>>> print(cv_result)
  (0, 0)    1
  (0, 2)    1
  (0, 1)    1
>>> print(f'used: {(cv_result.data.nbytes + cv_result.indptr.nbytes + cv_result.indices.nbytes)} bytesn')
used: 44 bytes

>>> 
>>> print(repr(hv_result))
<1x1048576 sparse matrix of type '<class 'numpy.uint8'>'
    with 3 stored elements in Compressed Sparse Row format>
>>> print(hv_result)
  (0, 824960)   1
  (0, 884299)   1
  (0, 948532)   1
>>> print(f'used: {(hv_result.data.nbytes + hv_result.indptr.nbytes + hv_result.indices.nbytes)} bytes')
used: 23 bytes
Answered By: Maurício Collaça