Checking fuzzy/approximate substring existing in a longer string, in Python?

Question:

Using algorithms like leveinstein ( leveinstein or difflib) , it is easy to find approximate matches.eg.

>>> import difflib
>>> difflib.SequenceMatcher(None,"amazing","amaging").ratio()
0.8571428571428571

The fuzzy matches can be detected by deciding a threshold as needed.

Current requirement : To find fuzzy substring based on a threshold in a bigger string.

eg.

large_string = "thelargemanhatanproject is a great project in themanhattincity"
query_string = "manhattan"
#result = "manhatan","manhattin" and their indexes in large_string

One brute force solution is to generate all substrings of length N-1 to N+1 ( or other matching length),where N is length of query_string, and use levenstein on them one by one and see the threshold.

Is there better solution available in python , preferably an included module in python 2.7 , or an externally available module .

———————UPDATE AND SOLUTION —————-

Python regex module works pretty well, though it is little bit slower than inbuilt re module for fuzzy substring cases, which is an obvious outcome due to extra operations.
The desired output is good and the control over magnitude of fuzziness can be easily defined.

>>> import regex
>>> input = "Monalisa was painted by Leonrdo da Vinchi"
>>> regex.search(r'b(leonardo){e<3}s+(da)s+(vinci){e<2}b',input,flags=regex.IGNORECASE)
<regex.Match object; span=(23, 41), match=' Leonrdo da Vinchi', fuzzy_counts=(0, 2, 1)>
Asked By: DhruvPathak

||

Answers:

How about using difflib.SequenceMatcher.get_matching_blocks?

>>> import difflib
>>> large_string = "thelargemanhatanproject"
>>> query_string = "manhattan"
>>> s = difflib.SequenceMatcher(None, large_string, query_string)
>>> sum(n for i,j,n in s.get_matching_blocks()) / float(len(query_string))
0.8888888888888888

>>> query_string = "banana"
>>> s = difflib.SequenceMatcher(None, large_string, query_string)
>>> sum(n for i,j,n in s.get_matching_blocks()) / float(len(query_string))
0.6666666666666666

UPDATE

import difflib

def matches(large_string, query_string, threshold):
    words = large_string.split()
    for word in words:
        s = difflib.SequenceMatcher(None, word, query_string)
        match = ''.join(word[i:i+n] for i, j, n in s.get_matching_blocks() if n)
        if len(match) / float(len(query_string)) >= threshold:
            yield match

large_string = "thelargemanhatanproject is a great project in themanhattincity"
query_string = "manhattan"
print list(matches(large_string, query_string, 0.8))

Above code print: ['manhatan', 'manhattn']

Answered By: falsetru

Recently I’ve written an alignment library for Python: https://github.com/eseraygun/python-alignment

Using it, you can perform both global and local alignments with arbitrary scoring strategies on any pair of sequences. Actually, in your case, you need semi-local alignments as you don’t care for the substrings of query_string. I’ve simulated semi-local algorithm using local alignment and some heuristics in the following code but it is easy to extend the library for a proper implementation.

Here is the example code in the README file modified for your case.

from alignment.sequence import Sequence, GAP_ELEMENT
from alignment.vocabulary import Vocabulary
from alignment.sequencealigner import SimpleScoring, LocalSequenceAligner

large_string = "thelargemanhatanproject is a great project in themanhattincity"
query_string = "manhattan"

# Create sequences to be aligned.
a = Sequence(large_string)
b = Sequence(query_string)

# Create a vocabulary and encode the sequences.
v = Vocabulary()
aEncoded = v.encodeSequence(a)
bEncoded = v.encodeSequence(b)

# Create a scoring and align the sequences using local aligner.
scoring = SimpleScoring(1, -1)
aligner = LocalSequenceAligner(scoring, -1, minScore=5)
score, encodeds = aligner.align(aEncoded, bEncoded, backtrace=True)

# Iterate over optimal alignments and print them.
for encoded in encodeds:
    alignment = v.decodeSequenceAlignment(encoded)

    # Simulate a semi-local alignment.
    if len(filter(lambda e: e != GAP_ELEMENT, alignment.second)) != len(b):
        continue
    if alignment.first[0] == GAP_ELEMENT or alignment.first[-1] == GAP_ELEMENT:
        continue
    if alignment.second[0] == GAP_ELEMENT or alignment.second[-1] == GAP_ELEMENT:
        continue

    print alignment
    print 'Alignment score:', alignment.score
    print 'Percent identity:', alignment.percentIdentity()
    print

The output for minScore=5 is as follows.

m a n h a - t a n
m a n h a t t a n
Alignment score: 7
Percent identity: 88.8888888889

m a n h a t t - i
m a n h a t t a n
Alignment score: 5
Percent identity: 77.7777777778

m a n h a t t i n
m a n h a t t a n
Alignment score: 7
Percent identity: 88.8888888889

If you remove the minScore argument, you will get only the best scoring matches.

m a n h a - t a n
m a n h a t t a n
Alignment score: 7
Percent identity: 88.8888888889

m a n h a t t i n
m a n h a t t a n
Alignment score: 7
Percent identity: 88.8888888889

Note that all algorithms in the library have O(n * m) time complexity, n and m being the lengths of the sequences.

Answered By: Eser Aygün

The new regex library that’s soon supposed to replace re includes fuzzy matching.

https://pypi.python.org/pypi/regex/

The fuzzy matching syntax looks fairly expressive, but this would give you a match with one or fewer insertions/additions/deletions.

import regex
regex.match('(amazing){e<=1}', 'amaging')
Answered By: mgbelisle

I use fuzzywuzzy to fuzzy match based on threshold and fuzzysearch to fuzzy extract words from the match.

process.extractBests takes a query, list of words and a cutoff score and returns a list of tuples of match and score above the cutoff score.

find_near_matches takes the result of process.extractBests and returns the start and end indices of words. I use the indices to build the words and use the built word to find the index in the large string. max_l_dist of find_near_matches is ‘Levenshtein distance’ which has to be adjusted to suit the needs.

from fuzzysearch import find_near_matches
from fuzzywuzzy import process

large_string = "thelargemanhatanproject is a great project in themanhattincity"
query_string = "manhattan"

def fuzzy_extract(qs, ls, threshold):
    '''fuzzy matches 'qs' in 'ls' and returns list of 
    tuples of (word,index)
    '''
    for word, _ in process.extractBests(qs, (ls,), score_cutoff=threshold):
        print('word {}'.format(word))
        for match in find_near_matches(qs, word, max_l_dist=1):
            match = word[match.start:match.end]
            print('match {}'.format(match))
            index = ls.find(match)
            yield (match, index)

To test:

query_string = "manhattan"
print('query: {}nstring: {}'.format(query_string, large_string))
for match,index in fuzzy_extract(query_string, large_string, 70):
    print('match: {}nindex: {}'.format(match, index))

query_string = "citi"
print('query: {}nstring: {}'.format(query_string, large_string))
for match,index in fuzzy_extract(query_string, large_string, 30):
    print('match: {}nindex: {}'.format(match, index))

query_string = "greet"
print('query: {}nstring: {}'.format(query_string, large_string))
for match,index in fuzzy_extract(query_string, large_string, 30):
    print('match: {}nindex: {}'.format(match, index))

Output:

query: manhattan  
string: thelargemanhatanproject is a great project in themanhattincity  
match: manhatan  
index: 8  
match: manhattin  
index: 49  

query: citi  
string: thelargemanhatanproject is a great project in themanhattincity  
match: city  
index: 58  

query: greet  
string: thelargemanhatanproject is a great project in themanhattincity  
match: great  
index: 29 
Answered By: Nizam Mohamed

The approaches above are good, but I needed to find a small needle in lots of hay, and ended up approaching it like this:

from difflib import SequenceMatcher as SM
from nltk.util import ngrams
import codecs

needle = "this is the string we want to find"
hay    = "text text lots of text and more and more this string is the one we wanted to find and here is some more and even more still"

needle_length  = len(needle.split())
max_sim_val    = 0
max_sim_string = u""

for ngram in ngrams(hay.split(), needle_length + int(.2*needle_length)):
    hay_ngram = u" ".join(ngram)
    similarity = SM(None, hay_ngram, needle).ratio() 
    if similarity > max_sim_val:
        max_sim_val = similarity
        max_sim_string = hay_ngram

print max_sim_val, max_sim_string

Yields:

0.72972972973 this string is the one we wanted to find
Answered By: duhaime

I ran into this problem, and I found that neither of the top two answers worked. Instead, I used the following algorithm to detect the minimally wrong fuzzy match:

def fuzzy_substring_search(cls, major: str, minor: str, errs: int = 4) -> Optional[regex.Match]:
    """Find the closest matching fuzzy substring.

    Args:
        major: the string to search in
        minor: the string to search with
        errs: the total number of errors

    Returns:
        Optional[regex.Match] object
    """
    errs_ = 0
    s = regex.search(f"({minor}){{e<={errs_}}}", major)
    while s is None and errs_ <= errs:
        errs_ += 1
        s = regex.search(f"({minor}){{e<={errs_}}}", major)
    return s

This has the benefit of returning exact matches if they exist, and escalating fuzziness as needed.

Answered By: Chris
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.