Regex match (w+) to capture single words delimited by |||
Question:
I am trying to match if there’s singe word followed by s|||s
and then another single word followed by s|||s
so I’m using this regex:
single_word_regex = r'(w+)+s|||s(w+)s|||s.*'
And when I tried to match this string, the regex matching hangs or take minutes (possibly going into some sort of “deep loop”)
>>> import re
>>> import time
>>> single_word_regex = r'(w+)+s|||s(w+)s|||s.*'
>>> x = u'amornratchatchawansawangwong ||| amornratchatchawansawangwong . ||| 0.594819 0.5 0.594819 0.25 ||| 0-0 0-1 ||| 1 1 1 ||| |||'
>>> z = u'amor 我 ||| amor . i ||| 0.594819 0.0585231 0.594819 0.0489472 ||| 0-0 0-1 1-2 ||| 2 2 2 ||| |||'
>>> y = u'amor ||| amor ||| 0.396546 0.0833347 0.29741 0.08 ||| 0-0 0-1 ||| 3 4 2 ||| |||'
>>> re.match(single_word_regex, z, re.U)
>>> re.match(single_word_regex, y, re.U)
<_sre.SRE_Match object at 0x105b879c0>
>>> start = time.time(); re.match(single_word_regex, y, re.U); print time.time() - start
9.60826873779e-05
>>> start = time.time(); re.match(single_word_regex, x, re.U); print time.time() - start # It hangs...
Why is it taking that long?
Is there a better/simpler regex to capture this conditionlen(x.split(' ||| ')[0].split()) == 1 == len(x.split(' ||| ').split())
?
Answers:
Note that by itself, a r'(w+)+'
pattern will not cause the catastrophic backtracking, it will only be “evil” inside a longer expression and especially when it is placed next to the start of the pattern since in case subsequent subpatterns fail the engine backtracks to this one, and as the 1+ quantifier inside is again quantified with +
, that creates a huge amount of possible variations to try before failing. You may have a look at your regex demo and click the regex debugger on the left to see example regex engine behavior.
The current regex can be written as
r'^(w+)s|{3}s(w+)s|{3}s(.*)'
See the regex demo where there will be a match if you remove space and .
in the second field.
Details:
^
– start of string (not necessary with re.match
)
(w+)
– (Group 1) 1+ letters/digits/underscores
s
– a whitespace
|{3}
– 3 pipe symbols
s(w+)s|{3}s
– see above ((w+)
creates Group 2)
(.*)
– (Group 3) any 0+ characters other than linebreak characters.
I am trying to match if there’s singe word followed by s|||s
and then another single word followed by s|||s
so I’m using this regex:
single_word_regex = r'(w+)+s|||s(w+)s|||s.*'
And when I tried to match this string, the regex matching hangs or take minutes (possibly going into some sort of “deep loop”)
>>> import re
>>> import time
>>> single_word_regex = r'(w+)+s|||s(w+)s|||s.*'
>>> x = u'amornratchatchawansawangwong ||| amornratchatchawansawangwong . ||| 0.594819 0.5 0.594819 0.25 ||| 0-0 0-1 ||| 1 1 1 ||| |||'
>>> z = u'amor 我 ||| amor . i ||| 0.594819 0.0585231 0.594819 0.0489472 ||| 0-0 0-1 1-2 ||| 2 2 2 ||| |||'
>>> y = u'amor ||| amor ||| 0.396546 0.0833347 0.29741 0.08 ||| 0-0 0-1 ||| 3 4 2 ||| |||'
>>> re.match(single_word_regex, z, re.U)
>>> re.match(single_word_regex, y, re.U)
<_sre.SRE_Match object at 0x105b879c0>
>>> start = time.time(); re.match(single_word_regex, y, re.U); print time.time() - start
9.60826873779e-05
>>> start = time.time(); re.match(single_word_regex, x, re.U); print time.time() - start # It hangs...
Why is it taking that long?
Is there a better/simpler regex to capture this conditionlen(x.split(' ||| ')[0].split()) == 1 == len(x.split(' ||| ').split())
?
Note that by itself, a r'(w+)+'
pattern will not cause the catastrophic backtracking, it will only be “evil” inside a longer expression and especially when it is placed next to the start of the pattern since in case subsequent subpatterns fail the engine backtracks to this one, and as the 1+ quantifier inside is again quantified with +
, that creates a huge amount of possible variations to try before failing. You may have a look at your regex demo and click the regex debugger on the left to see example regex engine behavior.
The current regex can be written as
r'^(w+)s|{3}s(w+)s|{3}s(.*)'
See the regex demo where there will be a match if you remove space and .
in the second field.
Details:
^
– start of string (not necessary withre.match
)(w+)
– (Group 1) 1+ letters/digits/underscoress
– a whitespace|{3}
– 3 pipe symbolss(w+)s|{3}s
– see above ((w+)
creates Group 2)(.*)
– (Group 3) any 0+ characters other than linebreak characters.