python regex speed

Question:

regarding regex (specifically python re), if we ignore the way the expression is written, is the length of the text the only factor for the time required to process the document? Or are there other factors (like how the text is structured) that play important roles too?

Asked By: goh

||

Answers:

Both the length of the text and its contents are important.

As an example the regular expression a+b will fail to match quickly on a string containing one million bs but more slowly on a string containing one million as. This is because more backtracking will be required in the second case.

import timeit
x = "re.search('a+b', s)"
print timeit.timeit(x, "import re;s='a'*10000", number=10)
print timeit.timeit(x, "import re;s='b'*10000", number=10)

Results:

6.85791902323
0.00795443275612
Answered By: Mark Byers

One important consideration can also be whether the text actually matches the regular expression. Take (as a contrived example) the regex (x+x+)+y from this regex tutorial.

When applied to xxxxxxxxxxy it matches, taking the regex engine 7 steps. When applied to xxxxxxxxxx, it fails (of course), but it takes the engine 2558 steps to arrive at this conclusion.

For xxxxxxxxxxxxxxy vs. xxxxxxxxxxxxxx it’s already 7 vs 40958 steps, and so on exponentially…

This happens especially easily with nested repetitions or regexes where the same text can be matched by two or more different parts of the regex, forcing the engine to try all permutations before being able to declare failure. This is then called catastrophic backtracking.

Answered By: Tim Pietzcker

To refactor a regex to create a multi-level trie covers 95% of of the
800% increase in performance. The other 5% involves factoring to not only facilitate
the trie but to enhance it to give a possible 30x performance boost.

Answered By: sln
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.