Python re – force lazy quantifiers
Question:
Is there a simple way (that doesn’t include writing custom regex parser) to force lazy quantifiers in a provided regex. By that I mean either: replace default greedy quantifiers with lazy versions or change behavior of greedy (and possessive!) quantifiers, to work like lazy ones? The regular expression is provided by user, so it is only known at runtime (is it safe to run untrusted regex?). I’ve looked into the re
module and didn’t find any flag for that (why would it exist? this is a very specific use-case).
Answers:
You can do this with sre_parse
and sre_compile
, private modules used by the re
module. Their API is not public, so there’s no documentation and the API could change at any time. So, probably not for production use, but your bravery may vary.
The basic approach is to parse the regex into an abstract syntax tree (AST) using sre_parse.parse()
, walk the tree and change all the greedy matches (MAX_REPEAT
) to lazy (MIN_REPEAT
), and compile the modified AST using sre_compile.compile()
. I’ve tried to write it so it won’t break in future Pythons (this was written for 3.10.8), but who knows?
import re, sre_parse, sre_compile
def ungreedify(ast):
"""given a regex AST, change every greedy repeat to lazy"""
for i, x in enumerate(ast):
if isinstance(x, tuple) and x[0] == sre_parse.MAX_REPEAT:
ast[i] = (sre_parse.MIN_REPEAT,) + x[1:]
try:
ungreedify(x) # throws error for scalars
except TypeError:
pass
return ast
def compile_ungreedy(pattern):
"""ungreedify a regex string, returning compiled regex object"""
return sre_compile.compile(ungreedify(sre_parse.parse(pattern)))
Usage:
TAG = "<.+>" # the first HTML tag regex we all try
TEXT = "<p>this is a test</p>"
greedytag = re.compile(TAG)
lazytag = compile_ungreedy(TAG)
# prints list with one item because TAG matches whole string
print(greedytag.findall(TEXT))
# prints list with two items because TAG matches each tag
print(lazytag.findall(TEXT))
Is there a simple way (that doesn’t include writing custom regex parser) to force lazy quantifiers in a provided regex. By that I mean either: replace default greedy quantifiers with lazy versions or change behavior of greedy (and possessive!) quantifiers, to work like lazy ones? The regular expression is provided by user, so it is only known at runtime (is it safe to run untrusted regex?). I’ve looked into the re
module and didn’t find any flag for that (why would it exist? this is a very specific use-case).
You can do this with sre_parse
and sre_compile
, private modules used by the re
module. Their API is not public, so there’s no documentation and the API could change at any time. So, probably not for production use, but your bravery may vary.
The basic approach is to parse the regex into an abstract syntax tree (AST) using sre_parse.parse()
, walk the tree and change all the greedy matches (MAX_REPEAT
) to lazy (MIN_REPEAT
), and compile the modified AST using sre_compile.compile()
. I’ve tried to write it so it won’t break in future Pythons (this was written for 3.10.8), but who knows?
import re, sre_parse, sre_compile
def ungreedify(ast):
"""given a regex AST, change every greedy repeat to lazy"""
for i, x in enumerate(ast):
if isinstance(x, tuple) and x[0] == sre_parse.MAX_REPEAT:
ast[i] = (sre_parse.MIN_REPEAT,) + x[1:]
try:
ungreedify(x) # throws error for scalars
except TypeError:
pass
return ast
def compile_ungreedy(pattern):
"""ungreedify a regex string, returning compiled regex object"""
return sre_compile.compile(ungreedify(sre_parse.parse(pattern)))
Usage:
TAG = "<.+>" # the first HTML tag regex we all try
TEXT = "<p>this is a test</p>"
greedytag = re.compile(TAG)
lazytag = compile_ungreedy(TAG)
# prints list with one item because TAG matches whole string
print(greedytag.findall(TEXT))
# prints list with two items because TAG matches each tag
print(lazytag.findall(TEXT))