Regular Expressions in Python unexpectedly slow

Question:

Consider this Python code:

import timeit
import re

def one():
        any(s in mystring for s in ('foo', 'bar', 'hello'))

r = re.compile('(foo|bar|hello)')
def two():
        r.search(mystring)


mystring="hello"*1000
print([timeit.timeit(k, number=10000) for k in (one, two)])
mystring="goodbye"*1000
print([timeit.timeit(k, number=10000) for k in (one, two)])

Basically, I’m benchmarking two ways to check existence of one of several substrings in a large string.

What I get here (Python 3.2.3) is this output:

[0.36678314208984375, 0.03450202941894531]
[0.6672089099884033, 3.7519450187683105]

In the first case, the regular expression easily defeats the any expression – the regular expression finds the substring immediately, while the any has to check the whole string a couple of times before it gets to the correct substring.

But what’s going on in the second example? In the case where the substring isn’t present, the regular expression is surprisingly slow! This surprises me, since theoretically the regex only has to go over the string once, while the any expression has to go over the string 3 times. What’s wrong here? Is there a problem with my regex, or are Python regexs simply slow in this case?

Asked By: cha0site

||

Answers:

The reason the regex is so slow is because it not only has to go through the whole string, but it has to several calculations at every character.

The first one simply does this:

Does f match h? No.
Does b match h? No.
Does h match h? Yes.
Does e match e? Yes.
Does l match l? Yes.
Does l match l? Yes.
Does o match o? Yes.
Done. Match found.

The second one does this:

Does f match g? No.
Does b match g? No.
Does h match g? No.
Does f match o? No.
Does b match o? No.
Does h match o? No.
Does f match o? No.
Does b match o? No.
Does h match o? No.
Does f match d? No.
Does b match d? No.
Does h match d? No.
Does f match b? No.
Does b match b? Yes.
Does a match y? No.
Does h match b? No.
Does f match y? No.
Does b match y? No.
Does h match y? No.
Does f match e? No.
Does b match e? No.
Does h match e? No.
... 999 more times ...
Done. No match found.

I can only speculate about the difference between the any and regex, but I’m guessing the regex is slower mostly because it runs in a highly complex engine, and with state machine stuff and everything, it just isn’t as efficient as a specific implementation (in).

In the first string, the regex will find a match almost instantaneously, while any has to loop through the string twice before finding anything.

In the second string, however, the any performs essentially the same steps as the regex, but in a different order. This seems to point out that the any solution is faster, probably because it is simpler.

Specific code is more efficient than generic code. Any knowledge about the problem can be put to use in optimizing the solution. Simple code is preferred over complex code. Essentially, the regex is faster when the pattern will be near the start of the string, but in is faster when the pattern is near the end of the string, or not found at all.

Disclaimer: I don’t know Python. I know algorithms.

Answered By: Kendall Frey

You have a regexp that is made up of three regexps. Exactly how do you think that works, if the regexp doesn’t check this three times? 🙂 There’s no magic in computing, you still have to do three checks.

But the regexp will do each three tests character by character, while the “one()” method will check the whole string for one match before going onto the next one.

That the regexp is much faster in the first case is because you check for the string that will match last. That means one() needs to first look through the whole string for “foo”, then for “bar” and then for “hello”, where it matches. Move “hello” first, and one() and two() are almost the same speed, as the first match done in both cases succeed.

Regexps are much more complex tests than “in” so I’d expect it to be slower. I suspect that this complexity increases a lot when you use “|”, but I haven’t read the source for the regexp library, so what do I know. 🙂

Answered By: Lennart Regebro

Note to future readers

I think the correct answer is actually that Python’s string handling algorithms are really optimized for this case, and the re module is actually a bit slower. What I’ve written below is true, but is probably not relevant to the simple regexps I have in the question.

Original Answer

Apparently this is not a random fluke – Python’s re module really is slower. It looks like it uses a recursive backtracking approach when it fails to find a match, as opposed to building a DFA and simulating it.

It uses the backtracking approach even when there are no back references in the regular expression!

What this means is that in the worst case, Python regexs take exponential, and not linear, time!

This is a very detailed paper describing the issue:
http://swtch.com/~rsc/regexp/regexp1.html

I think this graph near the end summarizes it succinctly:
graph of performance of various regular expression implementations, time vs. string length

Answered By: cha0site

My coworker found the re2 library (https://code.google.com/p/re2/)? There is a python wrapper. It’s a bit to get installed on some systems.

I was having the same issue with some complex regexes and long strings — re2 sped the processing time up significantly — from seconds to milliseconds.

Answered By: Annie B
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.