What meaning does the length of a Word2vec vector have?

Question:

I am using Word2vec through gensim with Google’s pretrained vectors trained on Google News. I have noticed that the word vectors I can access by doing direct index lookups on the Word2Vec object are not unit vectors:

>>> import numpy
>>> from gensim.models import Word2Vec
>>> w2v = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
>>> king_vector = w2v['king']
>>> numpy.linalg.norm(king_vector)
2.9022589

However, in the most_similar method, these non-unit vectors are not used; instead, normalised versions are used from the undocumented .syn0norm property, which contains only unit vectors:

>>> w2v.init_sims()
>>> unit_king_vector = w2v.syn0norm[w2v.vocab['king'].index]
>>> numpy.linalg.norm(unit_king_vector)
0.99999994

The larger vector is just a scaled up version of the unit vector:

>>> king_vector - numpy.linalg.norm(king_vector) * unit_king_vector
array([  0.00000000e+00,  -1.86264515e-09,   0.00000000e+00,
         0.00000000e+00,  -1.86264515e-09,   0.00000000e+00,
        -7.45058060e-09,   0.00000000e+00,   3.72529030e-09,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
        ... (some lines omitted) ...
        -1.86264515e-09,  -3.72529030e-09,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00], dtype=float32)

Given that word similarity comparisons in Word2Vec are done by cosine similarity, it’s not obvious to me what the lengths of the non-normalised vectors mean – although I assume they mean something, since gensim exposes them to me rather than only exposing the unit vectors in .syn0norm.

How are the lengths of these non-normalised Word2vec vectors generated, and what is their meaning? For what calculations does it make sense to use the normalised vectors, and when should I use the non-normalised ones?

Asked By: Mark Amery

||

Answers:

I think the answer you are looking for is described in the 2015 paper Measuring Word Significance
using
Distributed Representations of Words
by Adriaan Schakel and Benjamin Wilson. The key points:

When a word appears
in different contexts, its vector gets moved in
different directions during updates. The final vector
then represents some sort of weighted average
over the various contexts. Averaging over vectors
that point in different directions typically results in
a vector that gets shorter with increasing number
of different contexts in which the word appears.
For words to be used in many different contexts,
they must carry little meaning. Prime examples of
such insignificant words are high-frequency stop
words, which are indeed represented by short vectors
despite their high term frequencies …


For given term frequency,
the vector length is seen to take values only in a
narrow interval. That interval initially shifts upwards
with increasing frequency. Around a frequency
of about 30, that trend reverses and the interval
shifts downwards.

Both forces determining the length of a word
vector are seen at work here. Small-frequency
words tend to be used consistently, so that the
more frequently such words appear, the longer
their vectors. This tendency is reflected by the upwards
trend in Fig. 3 at low frequencies. High-frequency
words, on the other hand, tend to be
used in many different contexts, the more so, the
more frequently they occur. The averaging over
an increasing number of different contexts shortens
the vectors representing such words. This tendency
is clearly reflected by the downwards trend
in Fig. 3 at high frequencies, culminating in punctuation
marks and stop words with short vectors at
the very end.

Graph showing the trend described in the previous excerpt

Figure 3: Word vector length v versus term frequency
tf of all words in the hep-th vocabulary.
Note the logarithmic scale used on the frequency
axis. The dark symbols denote bin means with the
kth bin containing the frequencies in the interval
[2k−1, 2k − 1] with k = 1, 2, 3, . . .. These means
are included as a guide to the eye. The horizontal
line indicates the length v = 1.37 of the mean
vector


4 Discussion

Most applications of distributed representations of
words obtained through word2vec so far centered
around semantics. A host of experiments have
demonstrated the extent to which the direction of
word vectors captures semantics. In this brief report,
it was pointed out that not only the direction,
but also the length of word vectors carries important
information. Specifically, it was shown that
word vector length furnishes, in combination with
term frequency, a useful measure of word significance.

Answered By: user3592579
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.