Python regex prefers longer fuzzy match to shorter exact match

Question:

I am using regex in Python to search for multiple patterns in a string. A simplified example would be as follows:

import regex
s = "vrhvydhvkzejjvksdlstringvhehvehvurejlcslvdk"  #string to look into
p = ['(?P<string>string)', '(?P<longtext>longtext)']  #patterns to search for
r = regex.compile('(?b)(' + " | ".join(p) + '){s<=3}')  #regex, allowing for 3 mismatches, bestmatch to be reported
r.search(s)   #searching for patterns p in string s
<regex.Match object; span=(18, 25), match='stringv', fuzzy_counts=(1, 0, 0)>   #search results

My expected result would be:

<regex.Match object; span=(18, 24), match='string', fuzzy_counts=(0, 0, 0)>

Why do regex reports a fuzzy match stringv with 1 mismatch instead of reporting the exact match string? And how do I need to modify my code to get to my expected results?

I am with Python-3.7.3 and regex 2.5.115

Asked By: Agathe

||

Answers:

The '(?e)(' + " | ".join(p) + '){s<=3}' results in a (?e)((?P<string>string) | (?P<longtext>longtext)){s<=3} regex, see the spaces around |. Since v is substituted for a space when matching the (?P<string>string) regex part, you get stringv as a match.

You need

r = regex.compile('(?b)(' + "|".join(p) + '){s<=3}')  #regex, allowing for 3 mismatches, bestmatch to be reported

See the Python demo:

import regex
s = "vrhvydhvkzejjvksdlstringvhehvehvurejlcslvdk"  #string to look into
p = ['(?P<string>string)', '(?P<longtext>longtext)']  #patterns to search for
rx = '(?e)(' + "|".join(p) + '){s<=3}' 
r = regex.compile(rx)  #regex, allowing for 3 mismatches, bestmatch to be reported
print( r.search(s) )
# => <regex.Match object; span=(18, 24), match='string'>
Answered By: Wiktor Stribiżew
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.