Can gensim Doc2Vec be used to compare a novel document to a trained model?

Question:

I have a set of documents that all fit a pre-defined category and have successfully trained a model off of those documents.

The question is, if I have a novel document, how can I calculate how closely this new document lines up with my trained model?

My current solution:

novel_vector = model.infer_vector(novel_doc_words, steps = 20)
similarity_scores = model.docvecs.most_similar([novel_vector])
average = 0
for score in similarity_scores:
  average += score[1]
overall_similarity = average/len(similarity_scores)

I was unable to find any convenience methods in the documentation

Asked By: mkalish

||

Answers:

There’s no built-in method to check this sort of “lines up with” value, with respect to the whole model.

A more typical approach, matching existing capabilities, would be to train a model on a diversity of documents – not just those in a specific category. Then, after inferring a new document’s vector, calculate its average distance to documents of just the category of interest.

If you instead train a model on only documents of a certain self-similar category, the learned coordinate-space won’t as well reflect the full range of possible documents outside that category.

That said, if your current code – which checks how similar a new document is to the top-N nearest neighbors – seems to give good results for your purposes, maybe it’s acceptable. I’d just expect better results from a model that had trained on a wider variety of documents.

Answered By: gojomo

Perhaps I’m not understanding fully, but in the doc2vec tutorial from Gensim, under "Assessing the Model" where it checks for self similarity, it’s explained that:

Basically, we’re pretending as if the training corpus is some new unseen data and then seeing how they compare with the trained model.

And relates to this code:

ranks = []
second_ranks = []
for doc_id in range(len(train_corpus)):
    inferred_vector = model.infer_vector(train_corpus[doc_id].words)
    sims = model.dv.most_similar([inferred_vector], topn=len(model.dv))
    rank = [docid for docid, sim in sims].index(doc_id)
    ranks.append(rank)

    second_ranks.append(sims[1])

Could you not just tokenize an incoming new document and plug it in to model.infer_vector?

Would that not return a similarity score between the incoming document and the corpus used to train the model?

Answered By: server-snake