Python regex performance

Question:

I want to scan source code to find hard-coded credentials.

I have this python script :

import re
import sys
import time

def apply_regex(testn, regex, target):
    start = time.time()
    if  re.findall(regex, target):
        print(f"Match {testn} sucessful in {time.time() - start} seconds.")

regex = sys.argv[1]
subtarget = sys.argv[2]
apply_regex(1, regex, subtarget)
apply_regex(2, regex, f'{"1234567890" * 2000 } {subtarget}')
apply_regex(3, regex, f'{"1234567890" * 4000 } {subtarget}')

First run is :

python test.py 'password = ' 'password = "****"'
Match 1 sucessful in 0.0 seconds.
Match 2 sucessful in 0.0 seconds.
Match 3 sucessful in 0.0 seconds.

Now I want to find passwords with a prefix my-, so second run is :

python test.py '[a-z0-9-]*password = ' 'my-password = "****"'
Match 1 sucessful in 0.001003265380859375 seconds.
Match 2 sucessful in 1.892561674118042 seconds.
Match 3 sucessful in 6.460033416748047 seconds.

We can see that case 3 text is twice as long as case 2 text, but it takes 3.5 times as much time.

How can I change this regex [a-z0-9-]*password = to improve performance ?

** Update **

Based on suggestions of @WiktorStribiżew and @Barmar, third run :

python test.py 'b[a-z0-9-]{1,10}password = ' 'my-password = "****"'
Match 1 sucessful in 0.001001596450805664 seconds.
Match 2 sucessful in 0.001996278762817383 seconds.
Match 3 sucessful in 0.0020165443420410156 seconds.
Asked By: Philippe

||

Answers:

Here, the problem is with backtracking. The regex engine checks each position before each char and even at the end of the text and if the regex pattern matches the first subexpression, the regex engine starts working hard to check if the whole regex can find a match. Since the first subpattern is [a-z0-9-]*, a zero string matching pattern, the regex engine has to check each position.

There common trick here is to "tie" your regex to a specific context. For example, in many cases when you need to find just words, a word boundary, b, should suffice to speed up regex search. Yes, b[a-z0-9-]*password = would work faster, but it will still be searching at every word boundary location.

The Re2 library can be utilized to avoid the common NFA (Python re / regex are NFA engines) backtracking issues. Since this regex engine is not backtracking, it will be faster than re in the current scenario. Note that re2 does not support a lot of regex constructs available in NFAs, so only use it for "simple" regex searches (without lookbehind, or backreferences) like the one in the question.

Answered By: Wiktor Stribiżew
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.