Good Python modules for fuzzy string comparison?
Question:
I’m looking for a Python module that can do simple fuzzy string comparisons. Specifically, I’d like a percentage of how similar the strings are. I know this is potentially subjective so I was hoping to find a library that can do positional comparisons as well as longest similar string matches, among other things.
Basically, I’m hoping to find something that is simple enough to yield a single percentage while still configurable enough that I can specify what type of comparison(s) to do.
Answers:
difflib can do it.
Example from the docs:
>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']
>>> import keyword
>>> get_close_matches('wheel', keyword.kwlist)
['while']
>>> get_close_matches('apple', keyword.kwlist)
[]
>>> get_close_matches('accept', keyword.kwlist)
['except']
Check it out. It has other functions that can help you build something custom.
I like nosklo’s answer; another method is the Damerau-Levenshtein distance:
“In information theory and computer science, Damerau–Levenshtein distance is a ‘distance’ (string metric) between two strings, i.e., finite sequence of symbols, given by counting the minimum number of operations needed to transform one string into the other, where an operation is defined as an insertion, deletion, or substitution of a single character, or a transposition of two characters.”
An implementation in Python from Wikibooks:
def lev(a, b):
if not a: return len(b)
if not b: return len(a)
return min(lev(a[1:], b[1:])+(a[0] != b[0]),
lev(a[1:], b)+1, lev(a, b[1:])+1)
More from Wikibooks,
this gives you the length of the longest common substring (LCS):
def LCSubstr_len(S, T):
m = len(S); n = len(T)
L = [[0] * (n+1) for i in xrange(m+1)]
lcs = 0
for i in xrange(m):
for j in xrange(n):
if S[i] == T[j]:
L[i+1][j+1] = L[i][j] + 1
lcs = max(lcs, L[i+1][j+1])
return lcs
There is also Google’s own google-diff-match-patch (“Currently available in Java, JavaScript, C++ and Python”).
(Can’t comment on it, since I have only used python’s difflib myself)
Levenshtein Python extension and C library.
https://github.com/ztane/python-Levenshtein/
The Levenshtein Python C extension module contains functions for fast
computation of
– Levenshtein (edit) distance, and edit operations
– string similarity
– approximate median strings, and generally string averaging
– string sequence and set similarity
It supports both normal and Unicode strings.
$ pip install python-levenshtein
...
$ python
>>> import Levenshtein
>>> help(Levenshtein.ratio)
ratio(...)
Compute similarity of two strings.
ratio(string1, string2)
The similarity is a number between 0 and 1, it's usually equal or
somewhat higher than difflib.SequenceMatcher.ratio(), becuase it's
based on real minimal edit distance.
Examples:
>>> ratio('Hello world!', 'Holly grail!')
0.58333333333333337
>>> ratio('Brian', 'Jesus')
0.0
>>> help(Levenshtein.distance)
distance(...)
Compute absolute Levenshtein distance of two strings.
distance(string1, string2)
Examples (it's hard to spell Levenshtein correctly):
>>> distance('Levenshtein', 'Lenvinsten')
4
>>> distance('Levenshtein', 'Levensthein')
2
>>> distance('Levenshtein', 'Levenshten')
1
>>> distance('Levenshtein', 'Levenshtein')
0
Here’s a python script for computing longest common substring in two words(may need tweaking to work for multi-word phrases):
def lcs(word1, word2):
w1 = set(word1[i:j] for i in range(0, len(word1))
for j in range(1, len(word1) + 1))
w2 = set(word2[i:j] for i in range(0, len(word2))
for j in range(1, len(word2) + 1))
common_subs = w1.intersection(w2)
sorted_cmn_subs = sorted([
(len(str), str) for str in list(common_subs)
])
return sorted_cmn_subs.pop()[1]
As nosklo said, use the difflib module from the Python standard library.
The difflib module can return a measure of the sequences’ similarity using the ratio()
method of a SequenceMatcher() object. The similarity is returned as a float in the range 0.0 to 1.0.
>>> import difflib
>>> difflib.SequenceMatcher(None, 'abcde', 'abcde').ratio()
1.0
>>> difflib.SequenceMatcher(None, 'abcde', 'zbcde').ratio()
0.80000000000000004
>>> difflib.SequenceMatcher(None, 'abcde', 'zyzzy').ratio()
0.0
Another alternative would be to use the recently released package FuzzyWuzzy. The various functions supported by the package are also described in this blogpost.
Heres the way how it can be done using Charicar’s simhash, this is also suitable for long documents, it will detect 100% similarity also when you change order of words in documents too
http://blog.simpliplant.eu/calculating-similarity-between-text-strings-in-python/
Jellyfish is a Python module which supports many string comparison metrics including phonetic matching. Pure Python implementations of Levenstein edit distance are quite slow compared to Jellyfish’s implementation.
Example Usage:
import jellyfish
>>> jellyfish.levenshtein_distance('jellyfish', 'smellyfish')
2
>>> jellyfish.jaro_distance('jellyfish', 'smellyfish')
0.89629629629629637
>>> jellyfish.damerau_levenshtein_distance('jellyfish', 'jellyfihs')
1
>>> jellyfish.metaphone('Jellyfish')
'JLFX'
>>> jellyfish.soundex('Jellyfish')
'J412'
>>> jellyfish.nysiis('Jellyfish')
'JALYF'
>>> jellyfish.match_rating_codex('Jellyfish')
'JLLFSH'`
I am using double-metaphone which works like a charm.
An example:
>>> dm(u'aubrey')
('APR', '')
>>> dm(u'richard')
('RXRT', 'RKRT')
>>> dm(u'katherine') == dm(u'catherine')
True
Update:
Jellyfish also has it. Comes under Phonetic encoding.
Take a look at the Fuzzy module. It has fast (written in C) based algorithms for soundex, NYSIIS and double-metaphone.
A good introduction can be found at: http://www.informit.com/articles/article.aspx?p=1848528
I’ve been using Fuzzy Wuzzy from Seat Geek with great success.
https://github.com/seatgeek/fuzzywuzzy
Specifically the token set ratio function…
They also did a great write up on the process of fuzzy string matching:
http://seatgeek.com/blog/dev/fuzzywuzzy-fuzzy-string-matching-in-python
I’m looking for a Python module that can do simple fuzzy string comparisons. Specifically, I’d like a percentage of how similar the strings are. I know this is potentially subjective so I was hoping to find a library that can do positional comparisons as well as longest similar string matches, among other things.
Basically, I’m hoping to find something that is simple enough to yield a single percentage while still configurable enough that I can specify what type of comparison(s) to do.
difflib can do it.
Example from the docs:
>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']
>>> import keyword
>>> get_close_matches('wheel', keyword.kwlist)
['while']
>>> get_close_matches('apple', keyword.kwlist)
[]
>>> get_close_matches('accept', keyword.kwlist)
['except']
Check it out. It has other functions that can help you build something custom.
I like nosklo’s answer; another method is the Damerau-Levenshtein distance:
“In information theory and computer science, Damerau–Levenshtein distance is a ‘distance’ (string metric) between two strings, i.e., finite sequence of symbols, given by counting the minimum number of operations needed to transform one string into the other, where an operation is defined as an insertion, deletion, or substitution of a single character, or a transposition of two characters.”
An implementation in Python from Wikibooks:
def lev(a, b):
if not a: return len(b)
if not b: return len(a)
return min(lev(a[1:], b[1:])+(a[0] != b[0]),
lev(a[1:], b)+1, lev(a, b[1:])+1)
More from Wikibooks,
this gives you the length of the longest common substring (LCS):
def LCSubstr_len(S, T):
m = len(S); n = len(T)
L = [[0] * (n+1) for i in xrange(m+1)]
lcs = 0
for i in xrange(m):
for j in xrange(n):
if S[i] == T[j]:
L[i+1][j+1] = L[i][j] + 1
lcs = max(lcs, L[i+1][j+1])
return lcs
There is also Google’s own google-diff-match-patch (“Currently available in Java, JavaScript, C++ and Python”).
(Can’t comment on it, since I have only used python’s difflib myself)
Levenshtein Python extension and C library.
https://github.com/ztane/python-Levenshtein/
The Levenshtein Python C extension module contains functions for fast
computation of
– Levenshtein (edit) distance, and edit operations
– string similarity
– approximate median strings, and generally string averaging
– string sequence and set similarity
It supports both normal and Unicode strings.
$ pip install python-levenshtein
...
$ python
>>> import Levenshtein
>>> help(Levenshtein.ratio)
ratio(...)
Compute similarity of two strings.
ratio(string1, string2)
The similarity is a number between 0 and 1, it's usually equal or
somewhat higher than difflib.SequenceMatcher.ratio(), becuase it's
based on real minimal edit distance.
Examples:
>>> ratio('Hello world!', 'Holly grail!')
0.58333333333333337
>>> ratio('Brian', 'Jesus')
0.0
>>> help(Levenshtein.distance)
distance(...)
Compute absolute Levenshtein distance of two strings.
distance(string1, string2)
Examples (it's hard to spell Levenshtein correctly):
>>> distance('Levenshtein', 'Lenvinsten')
4
>>> distance('Levenshtein', 'Levensthein')
2
>>> distance('Levenshtein', 'Levenshten')
1
>>> distance('Levenshtein', 'Levenshtein')
0
Here’s a python script for computing longest common substring in two words(may need tweaking to work for multi-word phrases):
def lcs(word1, word2):
w1 = set(word1[i:j] for i in range(0, len(word1))
for j in range(1, len(word1) + 1))
w2 = set(word2[i:j] for i in range(0, len(word2))
for j in range(1, len(word2) + 1))
common_subs = w1.intersection(w2)
sorted_cmn_subs = sorted([
(len(str), str) for str in list(common_subs)
])
return sorted_cmn_subs.pop()[1]
As nosklo said, use the difflib module from the Python standard library.
The difflib module can return a measure of the sequences’ similarity using the ratio()
method of a SequenceMatcher() object. The similarity is returned as a float in the range 0.0 to 1.0.
>>> import difflib
>>> difflib.SequenceMatcher(None, 'abcde', 'abcde').ratio()
1.0
>>> difflib.SequenceMatcher(None, 'abcde', 'zbcde').ratio()
0.80000000000000004
>>> difflib.SequenceMatcher(None, 'abcde', 'zyzzy').ratio()
0.0
Another alternative would be to use the recently released package FuzzyWuzzy. The various functions supported by the package are also described in this blogpost.
Heres the way how it can be done using Charicar’s simhash, this is also suitable for long documents, it will detect 100% similarity also when you change order of words in documents too
http://blog.simpliplant.eu/calculating-similarity-between-text-strings-in-python/
Jellyfish is a Python module which supports many string comparison metrics including phonetic matching. Pure Python implementations of Levenstein edit distance are quite slow compared to Jellyfish’s implementation.
Example Usage:
import jellyfish
>>> jellyfish.levenshtein_distance('jellyfish', 'smellyfish')
2
>>> jellyfish.jaro_distance('jellyfish', 'smellyfish')
0.89629629629629637
>>> jellyfish.damerau_levenshtein_distance('jellyfish', 'jellyfihs')
1
>>> jellyfish.metaphone('Jellyfish')
'JLFX'
>>> jellyfish.soundex('Jellyfish')
'J412'
>>> jellyfish.nysiis('Jellyfish')
'JALYF'
>>> jellyfish.match_rating_codex('Jellyfish')
'JLLFSH'`
I am using double-metaphone which works like a charm.
An example:
>>> dm(u'aubrey')
('APR', '')
>>> dm(u'richard')
('RXRT', 'RKRT')
>>> dm(u'katherine') == dm(u'catherine')
True
Update:
Jellyfish also has it. Comes under Phonetic encoding.
Take a look at the Fuzzy module. It has fast (written in C) based algorithms for soundex, NYSIIS and double-metaphone.
A good introduction can be found at: http://www.informit.com/articles/article.aspx?p=1848528
I’ve been using Fuzzy Wuzzy from Seat Geek with great success.
https://github.com/seatgeek/fuzzywuzzy
Specifically the token set ratio function…
They also did a great write up on the process of fuzzy string matching:
http://seatgeek.com/blog/dev/fuzzywuzzy-fuzzy-string-matching-in-python