# Levenstein distance substring

## Question:

Is there a good way to use levenstein distance to match one particular string to any region within a second longer string?

Example:

``````str1='aaaaa'
str2='bbbbbbaabaabbbb'

if str1 in str2 with a distance < 2:
return True
``````

So in the above example part of string 2 is `aabaa` and `distance(str1,str2) < 2` so the statement should return `True`.

The only way I can think to do this is take 5 chars from str2 at a time, compare that with str1 and then repeat this moving through str2. Unfortunately this seems really inefficient and I need to process a large amount of data this way.

The trick is usually to play with the insert (for shorter) or delete (for longer) costs. You may also want to consider using Damerau-Levenshtein instead.
https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance

The trick is to generate all the substrings of appropriate length of `b`, then compare each one.

``````def lev_dist(a,b):
length_cost = abs(len(a) - len(b))
diff_cost = sum(1 for (aa, bb) in zip(a,b) if aa != bb)
return diff_cost + length_cost

def all_substr_of_length(n, s):
if n > len(s):
return [s]
else:
return [s[i:i+n] for i in range(0, len(s)-n+1)]

def lev_substr(a, b):
"""Gives minimum lev distance of all substrings of b and
the single string a.
"""

return min(lev_dist(a, bb) for bb in all_substr_of_length(len(a), b))

if lev_substr(str1, str2) < 2:
# it works!
``````

You might have a look at the regex module that supports fuzzy matching:

``````>>> import regex
>>> regex.search("(aaaaa){s<2}", 'bbbbbbaabaabbbb')
<regex.Match object; span=(6, 11), match='aabaa', fuzzy_counts=(1, 0, 0)>
``````

Since you are looking are strings of equal length, you can also do a a Hamming distance which is likely far faster than a Levenstein distance on the same two strings:

``````str1='aaaaa'
str2='bbbbbbaabaabbbb'
for s in [str2[i:i+len(str1)] for i in range(0,len(str2)-len(str1)+1)]:
if sum(a!=b for a,b in zip(str1,s))<2:
print s    # prints 'aabaa'
``````

I encountered this problem before, and I have not found a solution without involving at least one `for` loop. I have implemented a solution that returns the number of matches under a given tolerance calling the already implemented Levenshtein distance in polyleven, which can speed up the calculation.

``````def count_matches(seq,frag,sim_thresh=0.9):
cont=0
n = len(frag)
L = len(seq)
assert(L>=n)
for m in range(L-n):
sim = 1-poly_lev(frag,seq[m:m+n])/n
if sim >= sim_thresh:
cont = cont+1
return cont
``````

The function calculates a similarity value (between 0 and 1) between a string fragment and all the same-length substrings of a longer sequence, being the similarity `1-levenshtein(str1,str2)/len(str1)`. This normalizes over the length of the fragment so it can give meaningful results for fragments of arbitrary length.

Categories: questions
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.