String similarity metrics in Python

Question:

I want to find string similarity between two strings. en.wikipedia has examples of some of them. code.google has a Python implementation of Levenshtein distance.
Is there a better algorithm, (and hopefully a Python library), under these constraints:

  1. I want to do fuzzy matches between strings. eg matches(‘Hello, All you people’, ‘hello, all You peopl’) should return True
  2. False negatives are acceptable, False positives, except in extremely rare cases are not.
  3. This is done in a non realtime setting, so speed is not (much) of concern.
  4. [Edit] I am comparing multi word strings.

Would something other than Levenshtein distance(or Levenshtein ratio) be a better algorithm for my case?

Asked By: agiliq

||

Answers:

Is that what you mean?

>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']
>>> import keyword
>>> get_close_matches('wheel', keyword.kwlist)
['while']
>>> get_close_matches('apple', keyword.kwlist)
[]
>>> get_close_matches('accept', keyword.kwlist)
['except']

look at http://docs.python.org/library/difflib.html#difflib.get_close_matches

Answered By: Tzury Bar Yochay

There’s a great resource for string similarity metrics at the University of Sheffield. It has a list of various metrics (beyond just Levenshtein) and has open-source implementations of them. Looks like many of them should be easy to adapt into Python.

http://web.archive.org/web/20081224234350/http://www.dcs.shef.ac.uk/~sam/stringmetrics.html

Here’s a bit of the list:

  • Hamming distance
  • Levenshtein distance
  • Needleman-Wunch distance or Sellers Algorithm
  • and many more…
Answered By: ariddell

I realize it’s not the same thing, but this is close enough:

>>> import difflib
>>> a = 'Hello, All you people'
>>> b = 'hello, all You peopl'
>>> seq=difflib.SequenceMatcher(a=a.lower(), b=b.lower())
>>> seq.ratio()
0.97560975609756095

You can make this as a function

def similar(seq1, seq2):
    return difflib.SequenceMatcher(a=seq1.lower(), b=seq2.lower()).ratio() > 0.9

>>> similar(a, b)
True
>>> similar('Hello, world', 'Hi, world')
False
Answered By: Nadia Alramli

I would use Levenshtein distance, or the so-called Damerau distance (which takes transpositions into account) rather than the difflib stuff for two reasons (1) “fast enough” (dynamic programming algo) and “whoooosh” (bit-bashing) C code is available and (2) well-understood behaviour e.g. Levenshtein satisfies the triangle inequality and thus can be used in e.g. a Burkhard-Keller tree.

Threshold: you should treat as “positive” only those cases where distance < (1 – X) * max(len(string1), len(string2)) and adjust X (the similarity factor) to suit yourself. One way of choosing X is to get a sample of matches, calculate X for each, ignore cases where X < say 0.8 or 0.9, then sort the remainder in descending order of X and eye-ball them and insert the correct result and calculate some cost-of-mistakes measure for various levels of X.

N.B. Your ape/apple example has distance 2, so X is 0.6 … I would only use a threshold as low as 0.75 if I were desperately looking for something and had a high false-negative penalty

Answered By: John Machin

I know this isn’t the same but you can adjust the ratio to filter out strings that are not similar enough and return the closest match to the string you are looking for.

Perhaps you would be more interested in semantic similarity metrics.

https://www.google.com/search?client=ubuntu&channel=fs&q=semantic+similarity+string+match&ie=utf-8&oe=utf-8

I realize you said speed is not an issue but if you are processing a lot of the strings for your algorithm the below is very helpful.

def spellcheck(self, sentence):
    #return ' '.join([difflib.get_close_matches(word, wordlist,1 , 0)[0] for word in sentence.split()])
    return ' '.join( [ sorted( { Levenshtein.ratio(x, word):x for x in wordlist }.items(), reverse=True)[0][1] for word in sentence.split() ] )

Its about 20 times faster than difflib.

https://pypi.python.org/pypi/python-Levenshtein/

import Levenshtein

Answered By: John

This snippet will calculate the difflib, Levenshtein, Sørensen, and Jaccard similarity values for two strings. In the snippet below, I was iterating over a tsv in which the strings of interest occupied columns [3] and [4] of the tsv. (pip install python-Levenshtein and pip install distance):

import codecs, difflib, Levenshtein, distance

with codecs.open("titles.tsv","r","utf-8") as f:
    title_list = f.read().split("n")[:-1]

    for row in title_list:

        sr      = row.lower().split("t")

        diffl   = difflib.SequenceMatcher(None, sr[3], sr[4]).ratio()
        lev     = Levenshtein.ratio(sr[3], sr[4]) 
        sor     = 1 - distance.sorensen(sr[3], sr[4])
        jac     = 1 - distance.jaccard(sr[3], sr[4])

        print diffl, lev, sor, jac
Answered By: duhaime

To avoid false positives, the method nratio() from the library ngramratio may help.

>>> pip install ngramratio

>>> from ngramratio import ngramratio
>>> SequenceMatcherExtended = ngramratio.SequenceMatcherExtended

>>> a = 'Hi there'
>>> b = 'Hit here'

>>> seq=SequenceMatcherExtended(a=a.lower(), b=b.lower())

>>> seq.ratio()
>>> 0.875
>>> seq.nratio(1) #this replicates `seq.ratio`.
>>> 0.875

>>> seq.nratio(2)
>>> 0.75

>>> seq.nratio(3)
>>> 0.5

nratio(n) only matches n-grams of length >= n.

You can pick a value for n, say n = 2, and create a boolean similarity function as Nadia did in a previous reply.

def similar(seq1, seq2):
    return SequenceMatcherExtended(a=seq1.lower(), b=seq2.lower()).nratio(2) > 0.8

>>> similar(a, b)
False
>>> similar('Hi there', 'Hi ther')
True
Answered By: giacomo