Find the similarity metric between two strings
Question:
How do I get the probability of a string being similar to another string in Python?
I want to get a decimal value like 0.9 (meaning 90%) etc. Preferably with standard Python and library.
e.g.
similar("Apple","Appel") #would have a high prob.
similar("Apple","Mango") #would have a lower prob.
Answers:
You can create a function like:
def similar(w1, w2):
w1 = w1 + ' ' * (len(w2)  len(w1))
w2 = w2 + ' ' * (len(w1)  len(w2))
return sum(1 if i == j else 0 for i, j in zip(w1, w2)) / float(len(w1))
There is a built in.
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
Using it:
>>> similar("Apple","Appel")
0.8
>>> similar("Apple","Mango")
0.0
I think maybe you are looking for an algorithm describing the distance between strings. Here are some you may refer to:
TheFuzz
is a package that implements Levenshtein distance in python, with some helper functions to help in certain situations where you may want two distinct strings to be considered identical. For example:
>>> fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
91
>>> fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
100
Package distance includes Levenshtein distance:
import distance
distance.levenshtein("lenvestein", "levenshtein")
# 3
Solution #1: Python builtin
use SequenceMatcher from difflib
pros:
native python library, no need extra package.
cons: too limited, there are so many other good algorithms for string similarity out there.
example :
>>> from difflib import SequenceMatcher
>>> s = SequenceMatcher(None, "abcd", "bcde")
>>> s.ratio()
0.75
Solution #2: jellyfish library
its a very good library with good coverage and few issues.
it supports:
– Levenshtein Distance
– DamerauLevenshtein Distance
– Jaro Distance
– JaroWinkler Distance
– Match Rating Approach Comparison
– Hamming Distance
pros:
easy to use, gamut of supported algorithms, tested.
cons: not native library.
example:
>>> import jellyfish
>>> jellyfish.levenshtein_distance(u'jellyfish', u'smellyfish')
2
>>> jellyfish.jaro_distance(u'jellyfish', u'smellyfish')
0.89629629629629637
>>> jellyfish.damerau_levenshtein_distance(u'jellyfish', u'jellyfihs')
1
The builtin SequenceMatcher
is very slow on large input, here’s how it can be done with diffmatchpatch:
from diff_match_patch import diff_match_patch
def compute_similarity_and_diff(text1, text2):
dmp = diff_match_patch()
dmp.Diff_Timeout = 0.0
diff = dmp.diff_main(text1, text2, False)
# similarity
common_text = sum([len(txt) for op, txt in diff if op == 0])
text_length = max(len(text1), len(text2))
sim = common_text / text_length
return sim, diff
Note, difflib.SequenceMatcher
only finds the longest contiguous matching subsequence, this is often not what is desired, for example:
>>> a1 = "Apple"
>>> a2 = "Appel"
>>> a1 *= 50
>>> a2 *= 50
>>> SequenceMatcher(None, a1, a2).ratio()
0.012 # very low
>>> SequenceMatcher(None, a1, a2).get_matching_blocks()
[Match(a=0, b=0, size=3), Match(a=250, b=250, size=0)] # only the first block is recorded
Finding the similarity between two strings is closely related to the concept of pairwise sequence alignment in bioinformatics. There are many dedicated libraries for this including biopython. This example implements the Needleman Wunsch algorithm:
>>> from Bio.Align import PairwiseAligner
>>> aligner = PairwiseAligner()
>>> aligner.score(a1, a2)
200.0
>>> aligner.algorithm
'NeedlemanWunsch'
Using biopython or another bioinformatics package is more flexible than any part of the python standard library since many different scoring schemes and algorithms are available. Also, you can actually get the matching sequences to visualise what is happening:
>>> alignment = next(aligner.align(a1, a2))
>>> alignment.score
200.0
>>> print(alignment)
AppleAppleAppleAppleAppleAppleAppleAppleAppleAppleAppleAppleAppleAppleAppleAppleAppleAppleAppleAppleAppleAppleAppleAppleAppleAppleAppleAppleAppleAppleAppleAppleAppleAppleAppleAppleAppleAppleAppleAppleAppleAppleAppleAppleAppleAppleAppleAppleAppleApple

AppelAppelAppelAppelAppelAppelAppelAppelAppelAppelAppelAppelAppelAppelAppelAppelAppelAppelAppelAppelAppelAppelAppelAppelAppelAppelAppelAppelAppelAppelAppelAppelAppelAppelAppelAppelAppelAppelAppelAppelAppelAppelAppelAppelAppelAppelAppelAppelAppelAppel
You can find most of the text similarity methods and how they are calculated under this link: https://github.com/luozhouyang/pythonstringsimilarity#pythonstringsimilarity
Here some examples;

Normalized, metric, similarity and distance

(Normalized) similarity and distance

Metric distances
 Shingles (ngram) based similarity and distance
 Levenshtein
 Normalized Levenshtein
 Weighted Levenshtein
 DamerauLevenshtein
 Optimal String Alignment
 JaroWinkler
 Longest Common Subsequence
 Metric Longest Common Subsequence
 NGram
 Shingle(ngram) based algorithms
 QGram
 Cosine similarity
 Jaccard index
 SorensenDice coefficient
 Overlap coefficient (i.e.,SzymkiewiczSimpson)
There are many metrics to define similarity and distance between strings as mentioned above. I will give my 5 cents by showing an example of Jaccard similarity
with QGrams
and an example with edit distance
.
The libraries
from nltk.metrics.distance import jaccard_distance
from nltk.util import ngrams
from nltk.metrics.distance import edit_distance
Jaccard Similarity
1jaccard_distance(set(ngrams('Apple', 2)), set(ngrams('Appel', 2)))
and we get:
0.33333333333333337
And for the Apple
and Mango
1jaccard_distance(set(ngrams('Apple', 2)), set(ngrams('Mango', 2)))
and we get:
0.0
Edit Distance
edit_distance('Apple', 'Appel')
and we get:
2
And finally,
edit_distance('Apple', 'Mango')
and we get:
5
Cosine Similarity on QGrams (q=2)
Another solution is to work with the textdistance
library. I will provide an example of Cosine Similarity
import textdistance
1textdistance.Cosine(qval=2).distance('Apple', 'Appel')
and we get:
0.5
Textdistance:
TextDistance – python library for comparing distance between two or more sequences by many algorithms. It has Textdistance
 30+ algorithms
 Pure python implementation
 Simple usage
 More than two sequences comparing
 Some algorithms have more than one implementation in one class.
 Optional numpy usage for maximum speed.
Example1:
import textdistance
textdistance.hamming('test', 'text')
Output:
1
Example2:
import textdistance
textdistance.hamming.normalized_similarity('test', 'text')
Output:
0.75
Thanks and Cheers!!!
Here’s what i thought of:
import string
def match(a,b):
a,b = a.lower(), b.lower()
error = 0
for i in string.ascii_lowercase:
error += abs(a.count(i)  b.count(i))
total = len(a) + len(b)
return (totalerror)/total
if __name__ == "__main__":
print(match("pple inc", "Apple Inc."))
BLEUscore
BLEU, or the Bilingual Evaluation Understudy, is a score for comparing
a candidate translation of text to one or more reference translations.A perfect match results in a score of 1.0, whereas a perfect mismatch
results in a score of 0.0.Although developed for translation, it can be used to evaluate text
generated for a suite of natural language processing tasks.
Code:
import nltk
from nltk.translate import bleu
from nltk.translate.bleu_score import SmoothingFunction
smoothie = SmoothingFunction().method4
C1='Text'
C2='Best'
print('BLEUscore:',bleu([C1], C2, smoothing_function=smoothie))
Examples: By updating C1 and C2.
C1='Test' C2='Test'
BLEUscore: 1.0
C1='Test' C2='Best'
BLEUscore: 0.2326589746035907
C1='Test' C2='Text'
BLEUscore: 0.2866227639866161
You can also compare sentence similarity:
C1='It is tough.' C2='It is rough.'
BLEUscore: 0.7348889200874658
C1='It is tough.' C2='It is tough.'
BLEUscore: 1.0
Python3.6+=
No Libuary Imported
Works Well in most scenarios
In stack overflow, when you tries to add a tag or post a question, it bring up all relevant stuff. This is so convenient and is exactly the algorithm that I am looking for. Therefore, I coded a query set similarity filter.
def compare(qs, ip):
al = 2
v = 0
for ii, letter in enumerate(ip):
if letter == qs[ii]:
v += al
else:
ac = 0
for jj in range(al):
if ii  jj < 0 or ii + jj > len(qs)  1:
break
elif letter == qs[ii  jj] or letter == qs[ii + jj]:
ac += jj
break
v += ac
return v
def getSimilarQuerySet(queryset, inp, length):
return [k for tt, (k, v) in enumerate(reversed(sorted({it: compare(it, inp) for it in queryset}.items(), key=lambda item: item[1])))][:length]
if __name__ == "__main__":
print(compare('apple', 'mongo'))
# 0
print(compare('apple', 'apple'))
# 10
print(compare('apple', 'appel'))
# 7
print(compare('dude', 'ud'))
# 1
print(compare('dude', 'du'))
# 4
print(compare('dude', 'dud'))
# 6
print(compare('apple', 'mongo'))
# 2
print(compare('apple', 'appel'))
# 8
print(getSimilarQuerySet(
[
"java",
"jquery",
"javascript",
"jude",
"aja",
],
"ja",
2,
))
# ['javascript', 'java']
Explanation
compare
takes two string and returns a positive integer. you can edit the
al
allowed variable incompare
, it indicates how large the range we need to search through. It works like this: two strings are iterated, if same character is find at same index, then accumulator will be added to a largest value. Then, we search in the index range ofallowed
, if matched, add to the accumulator based on how far the letter is. (the further, the smaller) length
indicate how many items you want as result, that is most similar to input string.
Adding the Spacy NLP library also to the mix;
@profile
def main():
str1= "Mar 31 09:08:41 The world is beautiful"
str2= "Mar 31 19:08:42 Beautiful is the world"
print("NLP Similarity=",nlp(str1).similarity(nlp(str2)))
print("Diff lib similarity",SequenceMatcher(None, str1, str2).ratio())
print("Jellyfish lib similarity",jellyfish.jaro_distance(str1, str2))
if __name__ == '__main__':
#python3 m spacy download en_core_web_sm
#nlp = spacy.load("en_core_web_sm")
nlp = spacy.load("en_core_web_md")
main()
Run with Robert Kern’s line_profiler
kernprof l v ./python/loganalysis/testspacy.py
NLP Similarity= 0.9999999821467294
Diff lib similarity 0.5897435897435898
Jellyfish lib similarity 0.8561253561253562
However the time’s are revealing
Function: main at line 32
Line # Hits Time Per Hit % Time Line Contents
==============================================================
32 @profile
33 def main():
34 1 1.0 1.0 0.0 str1= "Mar 31 09:08:41 The world is beautiful"
35 1 0.0 0.0 0.0 str2= "Mar 31 19:08:42 Beautiful is the world"
36 1 43248.0 43248.0 99.1 print("NLP Similarity=",nlp(str1).similarity(nlp(str2)))
37 1 375.0 375.0 0.9 print("Diff lib similarity",SequenceMatcher(None, str1, str2).ratio())
38 1 30.0 30.0 0.1 print("Jellyfish lib similarity",jellyfish.jaro_distance(str1, str2))
I have my own for my purposes, which is 2x faster than difflib SequenceMatcher’s quick_ratio(), while providing similar results. a and b are strings:
score = 0
for letters in enumerate(a):
score = score + b.count(letters[1])