Python re – force lazy quantifiers

Question:

Is there a simple way (that doesn’t include writing custom regex parser) to force lazy quantifiers in a provided regex. By that I mean either: replace default greedy quantifiers with lazy versions or change behavior of greedy (and possessive!) quantifiers, to work like lazy ones? The regular expression is provided by user, so it is only known at runtime (is it safe to run untrusted regex?). I’ve looked into the re module and didn’t find any flag for that (why would it exist? this is a very specific use-case).

Asked By: oBrstisf8o

||

Answers:

You can do this with sre_parse and sre_compile, private modules used by the re module. Their API is not public, so there’s no documentation and the API could change at any time. So, probably not for production use, but your bravery may vary.

The basic approach is to parse the regex into an abstract syntax tree (AST) using sre_parse.parse(), walk the tree and change all the greedy matches (MAX_REPEAT) to lazy (MIN_REPEAT), and compile the modified AST using sre_compile.compile(). I’ve tried to write it so it won’t break in future Pythons (this was written for 3.10.8), but who knows?

import re, sre_parse, sre_compile

def ungreedify(ast):
    """given a regex AST, change every greedy repeat to lazy"""
    for i, x in enumerate(ast):
        if isinstance(x, tuple) and x[0] == sre_parse.MAX_REPEAT:
            ast[i] = (sre_parse.MIN_REPEAT,) + x[1:]
        try:
            ungreedify(x)   # throws error for scalars
        except TypeError:
            pass
    return ast

def compile_ungreedy(pattern):
    """ungreedify a regex string, returning compiled regex object"""
    return sre_compile.compile(ungreedify(sre_parse.parse(pattern)))

Usage:

TAG = "<.+>"    # the first HTML tag regex we all try
TEXT = "<p>this is a test</p>"

greedytag = re.compile(TAG)
lazytag = compile_ungreedy(TAG)

# prints list with one item because TAG matches whole string
print(greedytag.findall(TEXT))

# prints list with two items because TAG matches each tag
print(lazytag.findall(TEXT))
Answered By: kindall
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.