Regex match (w+) to capture single words delimited by |||

Question:

I am trying to match if there’s singe word followed by s|||s and then another single word followed by s|||s so I’m using this regex:

single_word_regex = r'(w+)+s|||s(w+)s|||s.*'

And when I tried to match this string, the regex matching hangs or take minutes (possibly going into some sort of “deep loop”)

>>> import re
>>> import time
>>> single_word_regex = r'(w+)+s|||s(w+)s|||s.*'        
>>> x = u'amornratchatchawansawangwong ||| amornratchatchawansawangwong . ||| 0.594819 0.5 0.594819 0.25 ||| 0-0 0-1 ||| 1 1 1 ||| |||'
>>> z = u'amor 我 ||| amor . i ||| 0.594819 0.0585231 0.594819 0.0489472 ||| 0-0 0-1 1-2 ||| 2 2 2 ||| |||'
>>> y = u'amor ||| amor ||| 0.396546 0.0833347 0.29741 0.08 ||| 0-0 0-1 ||| 3 4 2 ||| |||'
>>> re.match(single_word_regex, z, re.U)                                              
>>> re.match(single_word_regex, y, re.U)                                          
<_sre.SRE_Match object at 0x105b879c0>
>>> start = time.time(); re.match(single_word_regex, y, re.U); print time.time() - start
9.60826873779e-05
>>> start = time.time(); re.match(single_word_regex, x, re.U); print time.time() - start # It hangs...

Why is it taking that long?

Is there a better/simpler regex to capture this conditionlen(x.split(' ||| ')[0].split()) == 1 == len(x.split(' ||| ').split())?

Asked By: alvas

||

Answers:

Note that by itself, a r'(w+)+' pattern will not cause the catastrophic backtracking, it will only be “evil” inside a longer expression and especially when it is placed next to the start of the pattern since in case subsequent subpatterns fail the engine backtracks to this one, and as the 1+ quantifier inside is again quantified with +, that creates a huge amount of possible variations to try before failing. You may have a look at your regex demo and click the regex debugger on the left to see example regex engine behavior.

The current regex can be written as

r'^(w+)s|{3}s(w+)s|{3}s(.*)'

See the regex demo where there will be a match if you remove space and . in the second field.

Details:

  • ^ – start of string (not necessary with re.match)
  • (w+) – (Group 1) 1+ letters/digits/underscores
  • s – a whitespace
  • |{3} – 3 pipe symbols
  • s(w+)s|{3}s – see above ((w+) creates Group 2)
  • (.*) – (Group 3) any 0+ characters other than linebreak characters.
Answered By: Wiktor Stribiżew
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.