Interpreting negative Word2Vec similarity from gensim

Question:

E.g. we train a word2vec model using gensim:

from gensim import corpora, models, similarities
from gensim.models.word2vec import Word2Vec

documents = ["Human machine interface for lab abc computer applications",
              "A survey of user opinion of computer system response time",
              "The EPS user interface management system",
              "System and human system engineering testing of EPS",
              "Relation of user perceived response time to error measurement",
              "The generation of random binary unordered trees",
              "The intersection graph of paths in trees",
              "Graph minors IV Widths of trees and well quasi ordering",
              "Graph minors A survey"]

texts = [[word for word in document.lower().split()] for document in documents]
w2v_model = Word2Vec(texts, size=500, window=5, min_count=1)

And when we query the similarity between words, we find negative similarity scores:

>>> w2v_model.similarity('graph', 'computer')
0.046929569156789336
>>> w2v_model.similarity('graph', 'system')
0.063683518562347399
>>> w2v_model.similarity('survey', 'generation')
-0.040026775040430063
>>> w2v_model.similarity('graph', 'trees')
-0.0072684112978664561

How do we interpret the negative scores?

If it’s a cosine similarity shouldn’t the range be [0,1]?

What is the upper bound and lower bound of the Word2Vec.similarity(x,y) function? There isn’t much written in the docs: https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec.similarity =(

Looking at the Python wrapper code, there isn’t much too: https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/word2vec.py#L1165

(If possible, please do point me to the .pyx code of where the similarity function is implemented.)

Asked By: alvas

||

Answers:

Cosine similarity ranges from -1 to 1, same as a regular cosine wave.

Cosine Wave

As for the source:

https://github.com/RaRe-Technologies/gensim/blob/ba1ce894a5192fc493a865c535202695bb3c0424/gensim/models/word2vec.py#L1511

def similarity(self, w1, w2):
    """
    Compute cosine similarity between two words.
    Example::
      >>> trained_model.similarity('woman', 'man')
      0.73723527
      >>> trained_model.similarity('woman', 'woman')
      1.0
    """
    return dot(matutils.unitvec(self[w1]), matutils.unitvec(self[w2])
Answered By: Eugene K

As others have said, the cosine similarity can range from -1 to 1 based on the angle between the two vectors being compared. The exact implementation in gensim is a simple dot product of the normalized vectors.

https://github.com/RaRe-Technologies/gensim/blob/4f0e2ae0531d67cee8d3e06636e82298cb554b04/gensim/models/keyedvectors.py#L581

def similarity(self, w1, w2):
        """
        Compute cosine similarity between two words.
        Example::
          >>> trained_model.similarity('woman', 'man')
          0.73723527
          >>> trained_model.similarity('woman', 'woman')
          1.0
        """
        return dot(matutils.unitvec(self[w1]), matutils.unitvec(self[w2]))

In terms of interpretation, you can think of these values like you might think of correlation coefficients. A value of 1 is a perfect relationship between word vectors (e.g., “woman” compared with “woman”), a value of 0 represents no relationship between words, and a value of -1 represents a perfect opposite relationship between words.

Answered By: Donovan McMurray

The other respondents answered your question about the range of values in Word2Vec (W2V) but they didn’t address the underlying mechanism of how W2V generates negative values. The key to understanding why cosine similarity of a W2V vector can be negative is to appreciate that the W2V vectors are not the same as vectors based on simple counting. If the vectorization system was based upon a simple count of the number of times a word showed up within a range of n-words in the training corpus then all cosine similarities for the vocabulary would range from 0 to 1, like you asked in your question. Here’s a concrete example of a simple counting algorithm:

Starting corpus:
“The restaurant was fantastic and the waiters were really nice. I highly recommend eating there if you want to experience fine dining at its best.”

The vocabulary then would be:

{‘and’, ‘best’, ‘dining’, ‘eating’, ‘excellent’, ‘experience’, ‘experts’, ‘fantastic’, ‘fine’, ‘i’, ‘if’, ‘of’, ‘recommend’, ‘restaurant’, ‘the’, ‘there’, ‘to’, ‘waiters’, ‘want’, ‘was’, ‘were’, ‘you’}

And the vectors for “the”, “restaurant”, and “eating” are below (the vectors are the columns):
enter image description here

The scalars within each vector range from 0 to m (the count of the word that had the highest count in the corpus). When comparing two vectors using this system then the cosine similarity would always scale from 0 (orthogonal) to 1 (aligned). For example, cosine similarity for the above three words is:

"the" and "were": 0.1581
"the" and "experts": 0.3162
"were" and "experts": 0.5

But, this is not how W2V works, rather it is a simple 2-layer neural net (a hidden layer and an output layer). The model was trained to predict a missing word in a sentence (CBOW) and also the surrounding context words of a word in a sentence (Skipgram). While it was trained to do this function, the crafty part of the W2V is that we don’t care about the output after it has been trained. Instead, we want the embedding which is the value of the hidden layer for a word of interest (if you are unfamiliar with the inner workings of W2V then I recommend reading Chris McCormick’s blog on it[1]). The net result is that the scalar values within the embedding vector can be -1 to +1. As such, it’s possible to have one vector diametrically opposed to another and thus a cosine similarity of -1, something that is not possible with the simple frequency approach.

My interpretation of a negative cosine similarity value for two words is that they are either unrelated or possibly of opposing value but not necessarily antonyms of each other. I can think of examples where true antonyms might have high cosine similarity scores because, as John Rupert Firth[2] might have put it, they keep the same company.

[1]: http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
[2]: https://en.wikipedia.org/wiki/John_Rupert_Firth

Answered By: Robb Dunlap