Checking fuzzy/approximate substring existing in a longer string, in Python?
Question:
Using algorithms like leveinstein ( leveinstein or difflib) , it is easy to find approximate matches.eg.
>>> import difflib
>>> difflib.SequenceMatcher(None,"amazing","amaging").ratio()
0.8571428571428571
The fuzzy matches can be detected by deciding a threshold as needed.
Current requirement : To find fuzzy substring based on a threshold in a bigger string.
eg.
large_string = "thelargemanhatanproject is a great project in themanhattincity"
query_string = "manhattan"
#result = "manhatan","manhattin" and their indexes in large_string
One brute force solution is to generate all substrings of length N-1 to N+1 ( or other matching length),where N is length of query_string, and use levenstein on them one by one and see the threshold.
Is there better solution available in python , preferably an included module in python 2.7 , or an externally available module .
———————UPDATE AND SOLUTION —————-
Python regex module works pretty well, though it is little bit slower than inbuilt re
module for fuzzy substring cases, which is an obvious outcome due to extra operations.
The desired output is good and the control over magnitude of fuzziness can be easily defined.
>>> import regex
>>> input = "Monalisa was painted by Leonrdo da Vinchi"
>>> regex.search(r'b(leonardo){e<3}s+(da)s+(vinci){e<2}b',input,flags=regex.IGNORECASE)
<regex.Match object; span=(23, 41), match=' Leonrdo da Vinchi', fuzzy_counts=(0, 2, 1)>
Answers:
How about using difflib.SequenceMatcher.get_matching_blocks
?
>>> import difflib
>>> large_string = "thelargemanhatanproject"
>>> query_string = "manhattan"
>>> s = difflib.SequenceMatcher(None, large_string, query_string)
>>> sum(n for i,j,n in s.get_matching_blocks()) / float(len(query_string))
0.8888888888888888
>>> query_string = "banana"
>>> s = difflib.SequenceMatcher(None, large_string, query_string)
>>> sum(n for i,j,n in s.get_matching_blocks()) / float(len(query_string))
0.6666666666666666
UPDATE
import difflib
def matches(large_string, query_string, threshold):
words = large_string.split()
for word in words:
s = difflib.SequenceMatcher(None, word, query_string)
match = ''.join(word[i:i+n] for i, j, n in s.get_matching_blocks() if n)
if len(match) / float(len(query_string)) >= threshold:
yield match
large_string = "thelargemanhatanproject is a great project in themanhattincity"
query_string = "manhattan"
print list(matches(large_string, query_string, 0.8))
Above code print: ['manhatan', 'manhattn']
Recently I’ve written an alignment library for Python: https://github.com/eseraygun/python-alignment
Using it, you can perform both global and local alignments with arbitrary scoring strategies on any pair of sequences. Actually, in your case, you need semi-local alignments as you don’t care for the substrings of query_string
. I’ve simulated semi-local algorithm using local alignment and some heuristics in the following code but it is easy to extend the library for a proper implementation.
Here is the example code in the README file modified for your case.
from alignment.sequence import Sequence, GAP_ELEMENT
from alignment.vocabulary import Vocabulary
from alignment.sequencealigner import SimpleScoring, LocalSequenceAligner
large_string = "thelargemanhatanproject is a great project in themanhattincity"
query_string = "manhattan"
# Create sequences to be aligned.
a = Sequence(large_string)
b = Sequence(query_string)
# Create a vocabulary and encode the sequences.
v = Vocabulary()
aEncoded = v.encodeSequence(a)
bEncoded = v.encodeSequence(b)
# Create a scoring and align the sequences using local aligner.
scoring = SimpleScoring(1, -1)
aligner = LocalSequenceAligner(scoring, -1, minScore=5)
score, encodeds = aligner.align(aEncoded, bEncoded, backtrace=True)
# Iterate over optimal alignments and print them.
for encoded in encodeds:
alignment = v.decodeSequenceAlignment(encoded)
# Simulate a semi-local alignment.
if len(filter(lambda e: e != GAP_ELEMENT, alignment.second)) != len(b):
continue
if alignment.first[0] == GAP_ELEMENT or alignment.first[-1] == GAP_ELEMENT:
continue
if alignment.second[0] == GAP_ELEMENT or alignment.second[-1] == GAP_ELEMENT:
continue
print alignment
print 'Alignment score:', alignment.score
print 'Percent identity:', alignment.percentIdentity()
print
The output for minScore=5
is as follows.
m a n h a - t a n
m a n h a t t a n
Alignment score: 7
Percent identity: 88.8888888889
m a n h a t t - i
m a n h a t t a n
Alignment score: 5
Percent identity: 77.7777777778
m a n h a t t i n
m a n h a t t a n
Alignment score: 7
Percent identity: 88.8888888889
If you remove the minScore
argument, you will get only the best scoring matches.
m a n h a - t a n
m a n h a t t a n
Alignment score: 7
Percent identity: 88.8888888889
m a n h a t t i n
m a n h a t t a n
Alignment score: 7
Percent identity: 88.8888888889
Note that all algorithms in the library have O(n * m)
time complexity, n
and m
being the lengths of the sequences.
The new regex library that’s soon supposed to replace re includes fuzzy matching.
https://pypi.python.org/pypi/regex/
The fuzzy matching syntax looks fairly expressive, but this would give you a match with one or fewer insertions/additions/deletions.
import regex
regex.match('(amazing){e<=1}', 'amaging')
I use fuzzywuzzy to fuzzy match based on threshold and fuzzysearch to fuzzy extract words from the match.
process.extractBests
takes a query, list of words and a cutoff score and returns a list of tuples of match and score above the cutoff score.
find_near_matches
takes the result of process.extractBests
and returns the start and end indices of words. I use the indices to build the words and use the built word to find the index in the large string. max_l_dist
of find_near_matches
is ‘Levenshtein distance’ which has to be adjusted to suit the needs.
from fuzzysearch import find_near_matches
from fuzzywuzzy import process
large_string = "thelargemanhatanproject is a great project in themanhattincity"
query_string = "manhattan"
def fuzzy_extract(qs, ls, threshold):
'''fuzzy matches 'qs' in 'ls' and returns list of
tuples of (word,index)
'''
for word, _ in process.extractBests(qs, (ls,), score_cutoff=threshold):
print('word {}'.format(word))
for match in find_near_matches(qs, word, max_l_dist=1):
match = word[match.start:match.end]
print('match {}'.format(match))
index = ls.find(match)
yield (match, index)
To test:
query_string = "manhattan"
print('query: {}nstring: {}'.format(query_string, large_string))
for match,index in fuzzy_extract(query_string, large_string, 70):
print('match: {}nindex: {}'.format(match, index))
query_string = "citi"
print('query: {}nstring: {}'.format(query_string, large_string))
for match,index in fuzzy_extract(query_string, large_string, 30):
print('match: {}nindex: {}'.format(match, index))
query_string = "greet"
print('query: {}nstring: {}'.format(query_string, large_string))
for match,index in fuzzy_extract(query_string, large_string, 30):
print('match: {}nindex: {}'.format(match, index))
Output:
query: manhattan
string: thelargemanhatanproject is a great project in themanhattincity
match: manhatan
index: 8
match: manhattin
index: 49
query: citi
string: thelargemanhatanproject is a great project in themanhattincity
match: city
index: 58
query: greet
string: thelargemanhatanproject is a great project in themanhattincity
match: great
index: 29
The approaches above are good, but I needed to find a small needle in lots of hay, and ended up approaching it like this:
from difflib import SequenceMatcher as SM
from nltk.util import ngrams
import codecs
needle = "this is the string we want to find"
hay = "text text lots of text and more and more this string is the one we wanted to find and here is some more and even more still"
needle_length = len(needle.split())
max_sim_val = 0
max_sim_string = u""
for ngram in ngrams(hay.split(), needle_length + int(.2*needle_length)):
hay_ngram = u" ".join(ngram)
similarity = SM(None, hay_ngram, needle).ratio()
if similarity > max_sim_val:
max_sim_val = similarity
max_sim_string = hay_ngram
print max_sim_val, max_sim_string
Yields:
0.72972972973 this string is the one we wanted to find
I ran into this problem, and I found that neither of the top two answers worked. Instead, I used the following algorithm to detect the minimally wrong fuzzy match:
def fuzzy_substring_search(cls, major: str, minor: str, errs: int = 4) -> Optional[regex.Match]:
"""Find the closest matching fuzzy substring.
Args:
major: the string to search in
minor: the string to search with
errs: the total number of errors
Returns:
Optional[regex.Match] object
"""
errs_ = 0
s = regex.search(f"({minor}){{e<={errs_}}}", major)
while s is None and errs_ <= errs:
errs_ += 1
s = regex.search(f"({minor}){{e<={errs_}}}", major)
return s
This has the benefit of returning exact matches if they exist, and escalating fuzziness as needed.
Using algorithms like leveinstein ( leveinstein or difflib) , it is easy to find approximate matches.eg.
>>> import difflib
>>> difflib.SequenceMatcher(None,"amazing","amaging").ratio()
0.8571428571428571
The fuzzy matches can be detected by deciding a threshold as needed.
Current requirement : To find fuzzy substring based on a threshold in a bigger string.
eg.
large_string = "thelargemanhatanproject is a great project in themanhattincity"
query_string = "manhattan"
#result = "manhatan","manhattin" and their indexes in large_string
One brute force solution is to generate all substrings of length N-1 to N+1 ( or other matching length),where N is length of query_string, and use levenstein on them one by one and see the threshold.
Is there better solution available in python , preferably an included module in python 2.7 , or an externally available module .
———————UPDATE AND SOLUTION —————-
Python regex module works pretty well, though it is little bit slower than inbuilt re
module for fuzzy substring cases, which is an obvious outcome due to extra operations.
The desired output is good and the control over magnitude of fuzziness can be easily defined.
>>> import regex
>>> input = "Monalisa was painted by Leonrdo da Vinchi"
>>> regex.search(r'b(leonardo){e<3}s+(da)s+(vinci){e<2}b',input,flags=regex.IGNORECASE)
<regex.Match object; span=(23, 41), match=' Leonrdo da Vinchi', fuzzy_counts=(0, 2, 1)>
How about using difflib.SequenceMatcher.get_matching_blocks
?
>>> import difflib
>>> large_string = "thelargemanhatanproject"
>>> query_string = "manhattan"
>>> s = difflib.SequenceMatcher(None, large_string, query_string)
>>> sum(n for i,j,n in s.get_matching_blocks()) / float(len(query_string))
0.8888888888888888
>>> query_string = "banana"
>>> s = difflib.SequenceMatcher(None, large_string, query_string)
>>> sum(n for i,j,n in s.get_matching_blocks()) / float(len(query_string))
0.6666666666666666
UPDATE
import difflib
def matches(large_string, query_string, threshold):
words = large_string.split()
for word in words:
s = difflib.SequenceMatcher(None, word, query_string)
match = ''.join(word[i:i+n] for i, j, n in s.get_matching_blocks() if n)
if len(match) / float(len(query_string)) >= threshold:
yield match
large_string = "thelargemanhatanproject is a great project in themanhattincity"
query_string = "manhattan"
print list(matches(large_string, query_string, 0.8))
Above code print: ['manhatan', 'manhattn']
Recently I’ve written an alignment library for Python: https://github.com/eseraygun/python-alignment
Using it, you can perform both global and local alignments with arbitrary scoring strategies on any pair of sequences. Actually, in your case, you need semi-local alignments as you don’t care for the substrings of query_string
. I’ve simulated semi-local algorithm using local alignment and some heuristics in the following code but it is easy to extend the library for a proper implementation.
Here is the example code in the README file modified for your case.
from alignment.sequence import Sequence, GAP_ELEMENT
from alignment.vocabulary import Vocabulary
from alignment.sequencealigner import SimpleScoring, LocalSequenceAligner
large_string = "thelargemanhatanproject is a great project in themanhattincity"
query_string = "manhattan"
# Create sequences to be aligned.
a = Sequence(large_string)
b = Sequence(query_string)
# Create a vocabulary and encode the sequences.
v = Vocabulary()
aEncoded = v.encodeSequence(a)
bEncoded = v.encodeSequence(b)
# Create a scoring and align the sequences using local aligner.
scoring = SimpleScoring(1, -1)
aligner = LocalSequenceAligner(scoring, -1, minScore=5)
score, encodeds = aligner.align(aEncoded, bEncoded, backtrace=True)
# Iterate over optimal alignments and print them.
for encoded in encodeds:
alignment = v.decodeSequenceAlignment(encoded)
# Simulate a semi-local alignment.
if len(filter(lambda e: e != GAP_ELEMENT, alignment.second)) != len(b):
continue
if alignment.first[0] == GAP_ELEMENT or alignment.first[-1] == GAP_ELEMENT:
continue
if alignment.second[0] == GAP_ELEMENT or alignment.second[-1] == GAP_ELEMENT:
continue
print alignment
print 'Alignment score:', alignment.score
print 'Percent identity:', alignment.percentIdentity()
print
The output for minScore=5
is as follows.
m a n h a - t a n
m a n h a t t a n
Alignment score: 7
Percent identity: 88.8888888889
m a n h a t t - i
m a n h a t t a n
Alignment score: 5
Percent identity: 77.7777777778
m a n h a t t i n
m a n h a t t a n
Alignment score: 7
Percent identity: 88.8888888889
If you remove the minScore
argument, you will get only the best scoring matches.
m a n h a - t a n
m a n h a t t a n
Alignment score: 7
Percent identity: 88.8888888889
m a n h a t t i n
m a n h a t t a n
Alignment score: 7
Percent identity: 88.8888888889
Note that all algorithms in the library have O(n * m)
time complexity, n
and m
being the lengths of the sequences.
The new regex library that’s soon supposed to replace re includes fuzzy matching.
https://pypi.python.org/pypi/regex/
The fuzzy matching syntax looks fairly expressive, but this would give you a match with one or fewer insertions/additions/deletions.
import regex
regex.match('(amazing){e<=1}', 'amaging')
I use fuzzywuzzy to fuzzy match based on threshold and fuzzysearch to fuzzy extract words from the match.
process.extractBests
takes a query, list of words and a cutoff score and returns a list of tuples of match and score above the cutoff score.
find_near_matches
takes the result of process.extractBests
and returns the start and end indices of words. I use the indices to build the words and use the built word to find the index in the large string. max_l_dist
of find_near_matches
is ‘Levenshtein distance’ which has to be adjusted to suit the needs.
from fuzzysearch import find_near_matches
from fuzzywuzzy import process
large_string = "thelargemanhatanproject is a great project in themanhattincity"
query_string = "manhattan"
def fuzzy_extract(qs, ls, threshold):
'''fuzzy matches 'qs' in 'ls' and returns list of
tuples of (word,index)
'''
for word, _ in process.extractBests(qs, (ls,), score_cutoff=threshold):
print('word {}'.format(word))
for match in find_near_matches(qs, word, max_l_dist=1):
match = word[match.start:match.end]
print('match {}'.format(match))
index = ls.find(match)
yield (match, index)
To test:
query_string = "manhattan"
print('query: {}nstring: {}'.format(query_string, large_string))
for match,index in fuzzy_extract(query_string, large_string, 70):
print('match: {}nindex: {}'.format(match, index))
query_string = "citi"
print('query: {}nstring: {}'.format(query_string, large_string))
for match,index in fuzzy_extract(query_string, large_string, 30):
print('match: {}nindex: {}'.format(match, index))
query_string = "greet"
print('query: {}nstring: {}'.format(query_string, large_string))
for match,index in fuzzy_extract(query_string, large_string, 30):
print('match: {}nindex: {}'.format(match, index))
Output:
query: manhattan
string: thelargemanhatanproject is a great project in themanhattincity
match: manhatan
index: 8
match: manhattin
index: 49
query: citi
string: thelargemanhatanproject is a great project in themanhattincity
match: city
index: 58
query: greet
string: thelargemanhatanproject is a great project in themanhattincity
match: great
index: 29
The approaches above are good, but I needed to find a small needle in lots of hay, and ended up approaching it like this:
from difflib import SequenceMatcher as SM
from nltk.util import ngrams
import codecs
needle = "this is the string we want to find"
hay = "text text lots of text and more and more this string is the one we wanted to find and here is some more and even more still"
needle_length = len(needle.split())
max_sim_val = 0
max_sim_string = u""
for ngram in ngrams(hay.split(), needle_length + int(.2*needle_length)):
hay_ngram = u" ".join(ngram)
similarity = SM(None, hay_ngram, needle).ratio()
if similarity > max_sim_val:
max_sim_val = similarity
max_sim_string = hay_ngram
print max_sim_val, max_sim_string
Yields:
0.72972972973 this string is the one we wanted to find
I ran into this problem, and I found that neither of the top two answers worked. Instead, I used the following algorithm to detect the minimally wrong fuzzy match:
def fuzzy_substring_search(cls, major: str, minor: str, errs: int = 4) -> Optional[regex.Match]:
"""Find the closest matching fuzzy substring.
Args:
major: the string to search in
minor: the string to search with
errs: the total number of errors
Returns:
Optional[regex.Match] object
"""
errs_ = 0
s = regex.search(f"({minor}){{e<={errs_}}}", major)
while s is None and errs_ <= errs:
errs_ += 1
s = regex.search(f"({minor}){{e<={errs_}}}", major)
return s
This has the benefit of returning exact matches if they exist, and escalating fuzziness as needed.