Meaning behind 'thefuzz' / 'rapidfuzz' similarity metric when comparing strings

Question:

When using thefuzz in Python to calculate a simple ratio between two strings, a result of 0 means they are totally different while a result of 100 represents a 100% match. What do intermediate results mean? Does a result of 82, say, mean that the two files are 82% similar? Or is it just an abstract idea of ‘bigger is better?’

The documentation is sadly lacking in any detail to answer this question, so far as I can tell.

Asked By: David Shaw

||

Answers:

There are bunch of string matching algorithms that have been developed over the last… hundred years or so. I believe the string matching algorithm under the hood of this library is InDel.

InDel is a variation of the much more common Levenshtein distance algorithm. Levenshtein Distance essentially counts the number of needed insertions, deletions, and subsitutions necessary to get from the first string to the second string.

With InDel only insertions and deletions are counted. The ratio is calcuated by dividing the number of insertions and deletions into the length of both strings, and then subtracting from 1. So the closer to 1, the closer the match as it took less insertions and deletions to get from one string to the other.

The real question you have to determined for yourself, is how far away from 1 (a perfect match) do you want to accept for two strings being the same. Likely no matter what you choose you will end up with false positives/negatives.

Answered By: JNevill

It represents the normalized Levenshtein Distance, to be in [0,1]

normalized = levenshtein / (length_word1 + length_word2) # substitution weight of 2.
normalized = levenshtein / max(length_word1, length_word2) # substitution weight of 1.

Here they return the ratio as percent, therefore in [0, 100].

Note: That rapidfuzz uses the first interpretation, with weight 2 hence seeing a substitution as a deletion+insertion, hence the relation to InDel not allowing direct substitutions.


So yes you could say they are 82% similar w.r.t to the Levenshtein distance with substitution cost 2.

Answered By: Daraan