shape of my dataframe(#rows) and that of final embeddings array doesn't match

Question:

I generated the word embeddings for my corpus(2-D List) then tried to generate the Average Word2Vec embeddings for each of the individual word list(that is for each comment which have been converted into a list though split() method) inside my corpus but the final length of my average word2vec embeddings numpy array and that of the #rows doesn’t match i.e. 159571, which is the number of comments.

here’s the code for generating the ‘final_embeddings’ array:

#Building vocabulary
vocabulary = set(model.wv.index_to_key)

final_embeddings = []
for i in flatten_corpus:
    avg_embeddings = None
    for j in i:
      
         if j in vocabulary:

            if avg_embeddings is None:
                avg_embeddings = model.wv[j]
            else:
                avg_embeddings = avg_embeddings + model.wv[j]
    if avg_embeddings is not None:
        avg_embeddings = avg_embeddings / len(avg_embeddings)
        final_embeddings.append(avg_embeddings)

  • length of flatten_corpus: 159571
  • length of the above array: 159487 (doesn’t match to above number)

what am I doing wrong?

Asked By: YuvrajSingh

||

Answers:

You are only appending into your final_embeddings in a code branch that’s only sometimes reached: if there’s at least one known word in the text.

If any element of flatten_corpus only includes words that aren’t in the model, it will simply proceed to the next item in flatten_corpus.

And then, you’ll not only be missing those 84 items, but the average vectors in final_embeddings will no longer be aligned at the same slot indexes as their matching texts.

A quick and dirty fix would be to initialize your avg_embeddings to some value that stands-in, as the default, even if none of the words are known. For example:

    avg_embeddings = np.zeros(model.vector_size, dtype=np.float32)

Of course, having 84 of your per-text summary average vector be zero-vectors may cause other problems down the way, so you may want to think more about what, if anything you should be doing for such texts. Maybe, without word-vectors to model them, they should just be ignored.

Other notes on making code that is easier to debug:

  • using descriptive temporary variable names like ‘text’ & ‘word’ instead of ‘i’ & ‘j’ makes code clearer

  • you can already test whether a word is inside a set of word-vectors (model.wv, of Gensim class type KeyedVectors) with idiomatic Python membership-checking, so there’s no need to create your vocabulary set – instead just check with if word in model.wv:.

  • the KeyedVectors object has a utility method for getting the average of the word-vectors of a list-of-words, with other options that could prove helpful: .get_mean_vector() – and if you combine that with a Python list comprehension, your code can be replaced by a 1-liner:

final_embeddings = [model.wv.get_mean_vector(text) for text in flatten_corpus]
Answered By: gojomo