# How to compute the similarity between two text documents?

## Question:

I am looking at working on an NLP project, in any programming language (though Python will be my preference).

I want to take two documents and determine how similar they are.

The common way of doing this is to transform the documents into TF-IDF vectors and then compute the cosine similarity between them. Any textbook on information retrieval (IR) covers this. See esp. Introduction to Information Retrieval, which is free and available online.

### Computing Pairwise Similarities

TF-IDF (and similar text transformations) are implemented in the Python packages Gensim and scikit-learn. In the latter package, computing cosine similarities is as easy as

from sklearn.feature_extraction.text import TfidfVectorizer

documents = [open(f).read() for f in text_files]
tfidf = TfidfVectorizer().fit_transform(documents)
# no need to normalize, since Vectorizer will return normalized tf-idf
pairwise_similarity = tfidf * tfidf.T

or, if the documents are plain strings,

>>> corpus = ["I'd like an apple",
...           "An apple a day keeps the doctor away",
...           "Never compare an apple to an orange",
...           "I prefer scikit-learn to Orange",
...           "The scikit-learn docs are Orange and Blue"]
>>> vect = TfidfVectorizer(min_df=1, stop_words="english")
>>> tfidf = vect.fit_transform(corpus)
>>> pairwise_similarity = tfidf * tfidf.T

though Gensim may have more options for this kind of task.

[Disclaimer: I was involved in the scikit-learn TF-IDF implementation.]

### Interpreting the Results

From above, pairwise_similarity is a Scipy sparse matrix that is square in shape, with the number of rows and columns equal to the number of documents in the corpus.

>>> pairwise_similarity
<5x5 sparse matrix of type '<class 'numpy.float64'>'
with 17 stored elements in Compressed Sparse Row format>

You can convert the sparse array to a NumPy array via .toarray() or .A:

>>> pairwise_similarity.toarray()
array([[1.        , 0.17668795, 0.27056873, 0.        , 0.        ],
[0.17668795, 1.        , 0.15439436, 0.        , 0.        ],
[0.27056873, 0.15439436, 1.        , 0.19635649, 0.16815247],
[0.        , 0.        , 0.19635649, 1.        , 0.54499756],
[0.        , 0.        , 0.16815247, 0.54499756, 1.        ]])

Let’s say we want to find the document most similar to the final document, "The scikit-learn docs are Orange and Blue". This document has index 4 in corpus. You can find the index of the most similar document by taking the argmax of that row, but first you’ll need to mask the 1’s, which represent the similarity of each document to itself. You can do the latter through np.fill_diagonal(), and the former through np.nanargmax():

>>> import numpy as np

>>> arr = pairwise_similarity.toarray()
>>> np.fill_diagonal(arr, np.nan)

>>> input_doc = "The scikit-learn docs are Orange and Blue"
>>> input_idx = corpus.index(input_doc)
>>> input_idx
4

>>> result_idx = np.nanargmax(arr[input_idx])
>>> corpus[result_idx]
'I prefer scikit-learn to Orange'

Note: the purpose of using a sparse matrix is to save (a substantial amount of space) for a large corpus & vocabulary. Instead of converting to a NumPy array, you could do:

>>> n, _ = pairwise_similarity.shape
>>> pairwise_similarity[np.arange(n), np.arange(n)] = -1.0
>>> pairwise_similarity[input_idx].argmax()
3

Generally a cosine similarity between two documents is used as a similarity measure of documents. In Java, you can use Lucene (if your collection is pretty large) or LingPipe to do this. The basic concept would be to count the terms in every document and calculate the dot product of the term vectors. The libraries do provide several improvements over this general approach, e.g. using inverse document frequencies and calculating tf-idf vectors. If you are looking to do something copmlex, LingPipe also provides methods to calculate LSA similarity between documents which gives better results than cosine similarity.
For Python, you can use NLTK.

Here’s a little app to get you started…

import difflib as dl

sim = dl.get_close_matches

s = 0
wa = a.split()
wb = b.split()

for i in wa:
if sim(i, wb):
s += 1

n = float(s) / float(len(wa))
print '%d%% similarity' % int(n * 100)

You might want to try this online service for cosine document similarity http://www.scurtu.it/documentSimilarity.html

import urllib,urllib2
import json
API_URL="http://www.scurtu.it/apis/documentSimilarity"
inputDict={}
inputDict['doc1']='Document with some text'
inputDict['doc2']='Other document with some text'
params = urllib.urlencode(inputDict)
f = urllib2.urlopen(API_URL, params)
print responseObject

Identical to @larsman, but with some preprocessing

import nltk, string
from sklearn.feature_extraction.text import TfidfVectorizer

stemmer = nltk.stem.porter.PorterStemmer()
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)

def stem_tokens(tokens):
return [stemmer.stem(item) for item in tokens]

'''remove punctuation, lowercase, stem'''
def normalize(text):
return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))

vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')

def cosine_sim(text1, text2):
tfidf = vectorizer.fit_transform([text1, text2])
return ((tfidf * tfidf.T).A)[0,1]

print cosine_sim('a little bird', 'a little bird')
print cosine_sim('a little bird', 'a little bird chirps')
print cosine_sim('a little bird', 'a big dog barks')

It’s an old question, but I found this can be done easily with Spacy. Once the document is read, a simple api similarity can be used to find the cosine similarity between the document vectors.

pip install spacy

Then use like so:

import spacy
doc1 = nlp(u'Hello hi there!')
doc2 = nlp(u'Hello hi there!')
doc3 = nlp(u'Hey whatsup?')

print (doc1.similarity(doc2)) # 0.999999954642
print (doc2.similarity(doc3)) # 0.699032527716
print (doc1.similarity(doc3)) # 0.699032527716

If you are more interested in measuring semantic similarity of two pieces of text, I suggest take a look at this gitlab project. You can run it as a server, there is also a pre-built model which you can use easily to measure the similarity of two pieces of text; even though it is mostly trained for measuring the similarity of two sentences, you can still use it in your case.It is written in java but you can run it as a RESTful service.

Another option also is DKPro Similarity which is a library with various algorithm to measure the similarity of texts. However, it is also written in java.

code example:

// this similarity measure is defined in the dkpro.similarity.algorithms.lexical-asl package
// you need to add that to your .pom to make that example work
// there are some examples that should work out of the box in dkpro.similarity.example-gpl
TextSimilarityMeasure measure = new WordNGramJaccardMeasure(3);    // Use word trigrams

String[] tokens1 = "This is a short example text .".split(" ");
String[] tokens2 = "A short example text could look like that .".split(" ");

double score = measure.getSimilarity(tokens1, tokens2);

System.out.println("Similarity: " + score);

If you are looking for something very accurate, you need to use some better tool than tf-idf. Universal sentence encoder is one of the most accurate ones to find the similarity between any two pieces of text. Google provided pretrained models that you can use for your own application without a need to train from scratch anything. First, you have to install tensorflow and tensorflow-hub:

pip install tensorflow
pip install tensorflow_hub

The code below lets you convert any text to a fixed length vector representation and then you can use the dot product to find out the similarity between them

import tensorflow_hub as hub

# Import the Universal Sentence Encoder's TF Hub module
embed = hub.Module(module_url)

# sample text
messages = [
# Smartphones
"My phone is not good.",

# Weather
"Will it snow tomorrow?",
"Recently a lot of hurricanes have hit the US",

# Food and health
"An apple a day, keeps the doctors away",
"Eating strawberries is healthy",
]

similarity_input_placeholder = tf.placeholder(tf.string, shape=(None))
similarity_message_encodings = embed(similarity_input_placeholder)
with tf.Session() as session:
session.run(tf.global_variables_initializer())
session.run(tf.tables_initializer())
message_embeddings_ = session.run(similarity_message_encodings, feed_dict={similarity_input_placeholder: messages})

corr = np.inner(message_embeddings_, message_embeddings_)
print(corr)
heatmap(messages, messages, corr)

and the code for plotting:

def heatmap(x_labels, y_labels, values):
fig, ax = plt.subplots()
im = ax.imshow(values)

# We want to show all ticks...
ax.set_xticks(np.arange(len(x_labels)))
ax.set_yticks(np.arange(len(y_labels)))
# ... and label them with the respective list entries
ax.set_xticklabels(x_labels)
ax.set_yticklabels(y_labels)

# Rotate the tick labels and set their alignment.
plt.setp(ax.get_xticklabels(), rotation=45, ha="right", fontsize=10,
rotation_mode="anchor")

# Loop over data dimensions and create text annotations.
for i in range(len(y_labels)):
for j in range(len(x_labels)):
text = ax.text(j, i, "%.2f"%values[i, j],
ha="center", va="center", color="w",
fontsize=6)

fig.tight_layout()
plt.show()

the result would be:

as you can see the most similarity is between texts with themselves and then with their close texts in meaning.

IMPORTANT: the first time you run the code it will be slow because it needs to download the model. if you want to prevent it from downloading the model again and use the local model you have to create a folder for cache and add it to the environment variable and then after the first time running use that path:

tf_hub_cache_dir = "universal_encoder_cached/"
os.environ["TFHUB_CACHE_DIR"] = tf_hub_cache_dir

# pointing to the folder inside cache dir, it will be unique on your system
module_url = tf_hub_cache_dir+"/d8fbeb5c580e50f975ef73e80bebba9654228449/"
embed = hub.Module(module_url)

For Syntactic Similarity
There can be 3 easy ways of detecting similarity.

• Word2Vec
• Glove
• Tfidf or countvectorizer

For Semantic Similarity
One can use BERT Embedding and try a different word pooling strategies to get document embedding and then apply cosine similarity on document embedding.

An advanced methodology can use BERT SCORE to get similarity.

To find sentence similarity with very less dataset and to get high accuracy you can use below python package which is using pre-trained BERT models,

pip install similar-sentences

I am combining the solutions from answers of @FredFoo and @Renaud. My solution is able to apply @Renaud’s preprocessing on the text corpus of @FredFoo and then display pairwise similarities where the similarity is greater than 0. I ran this code on Windows by installing python and pip first. pip is installed as part of python but you may have to explicitly do it by re-running the installation package, choosing modify and then choosing pip. I use the command line to execute my python code saved in a file "similarity.py". I had to execute the following commands:

>set PYTHONPATH=%PYTHONPATH%;C:_location_of_python_lib_
>python -m pip install sklearn
>python -m pip install nltk
>py similarity.py

The code for similarity.py is as follows:

from sklearn.feature_extraction.text import TfidfVectorizer
import nltk, string
import numpy as np

stemmer = nltk.stem.porter.PorterStemmer()
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)

def stem_tokens(tokens):
return [stemmer.stem(item) for item in tokens]

def normalize(text):
return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))

corpus = ["I'd like an apple",
"An apple a day keeps the doctor away",
"Never compare an apple to an orange",
"I prefer scikit-learn to Orange",
"The scikit-learn docs are Orange and Blue"]

vect = TfidfVectorizer(tokenizer=normalize, stop_words='english')
tfidf = vect.fit_transform(corpus)

pairwise_similarity = tfidf * tfidf.T

#view the pairwise similarities
print(pairwise_similarity)

#check how a string is normalized
print(normalize("The scikit-learn docs are Orange and Blue"))

We can use sentencetransformer for this task

A simple example from sbert as below:

from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')
# Two lists of sentences
sentences1 = ['The cat sits outside']
sentences2 = ['The dog plays in the garden']
#Compute embedding for both lists
embeddings1 = model.encode(sentences1, convert_to_tensor=True)
embeddings2 = model.encode(sentences2, convert_to_tensor=True)
#Compute cosine-similarities
cosine_scores = util.cos_sim(embeddings1, embeddings2)
#Output the pairs with their score
for i in range(len(sentences1)):
print("{} tt {} tt Score: {:.4f}".format(sentences1[i],
sentences2[i], cosine_scores[i][i]))

Creator of the Simphile NLP text similarity Python package here. Simphile contains several text similarity methods that are language agnostic and less CPU-intensive than language embeddings.

Install:

pip install simphile

Choose your favorite method. This example shows three:

from simphile import jaccard_similarity, euclidian_similarity, compression_similarity

text_a = "I love dogs"
text_b = "I love cats"

print(f"Jaccard Similarity: {jaccard_similarity(text_a, text_b)}")
print(f"Euclidian Similarity: {euclidian_similarity(text_a, text_b)}")
print(f"Compression Similarity: {compression_similarity(text_a, text_b)}")
• Compression Similairty – leverages the pattern recognition of compression algorithms
• Euclidian Similarity – Treats text like points in multi-dimensional space and calculates their closeness
• Jaccard Similairy – Texts are more similar the more their words overlap
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.