Ways of obtaining a similarity metric between two full text documents?

Question

So imagine I have three text documents, for example (let 3 randomly generated texts).

Document 1:

“Whole every miles as tiled at seven or. Wished he entire esteem mr oh by. Possible bed you pleasure civility boy elegance ham. He prevent request by if in pleased. Picture too and concern has was comfort. Ten difficult resembled eagerness nor. Same park bore on be….”

Document 2:

“Style too own civil out along. Perfectly offending attempted add arranging age gentleman concluded. Get who uncommonly our expression ten increasing considered occasional travelling. Ever read tell year give may men call its. Piqued son turned fat income played end wicket…”

If I want to obtain in python (using libraries) a metric on how similar these 2 documents are to a third one (in other words, which one of the 2 documents is more similar to a third one) , what would be the best way to proceed?

edit: I have observed other questions that they answer by comparing individual sentences to other sentences, but I am not interested on that, as I want to compare a full text (consisting on related sentences) against another full text, and obtaining a number (which for example may be bigger than another comparison obtained with a different document which is less similar to the target one)

Asked By: Andres C

||

Source

Answer 1

There is no simple answer to this question. As similarities will perform better or worse depending on the particular task you want to perform.

Having said that, you do have a couple of options regarding comparing blocks of text. This post compares and ranks several different ways of computing sentence similarity, which you can then aggregate to perform full document similarity. How to aggregate this? will also depend on your particular task. A simple, but often well-performing approach is to compute the average sentence similarities of the 2 (or more) documents.

Other useful links for this topics include:

Introduction to Information Retrieval (free book)
Doc2Vec (from gensim, for paragraph embeddings, which is probably very suitable for your case)

Answered By: Mateo Torres

Answer 2

You could try the Simphile NLP text similarity library (disclosure: I’m the author). It offers several language agnostic methods: JaccardSimilarity, CompressionSimilarity, EuclidianSimilarity. Each has their advantages, but all work well on full document comparison:

Install:

pip install simphile

This example shows Jaccard, but is exactly the same with Euclidian or Compression:

from simphile import jaccard_similarity

text_a = "I love dogs"
text_b = "I love cats"

print(f"Jaccard Similarity: {jaccard_similarity(text_a, text_b)}")

Answered By: Brian Risk

Ways of obtaining a similarity metric between two full text documents?

Question:

Answers: