Python non-greedy regexes
Question:
How do I make a python regex like "(.*)"
such that, given "a (b) c (d) e"
python matches "b"
instead of "b) c (d"
?
I know that I can use "[^)]"
instead of "."
, but I’m looking for a more general solution that keeps my regex a little cleaner. Is there any way to tell python “hey, match this as soon as possible”?
Answers:
You seek the all-powerful *?
From the docs, Greedy versus Non-Greedy
the non-greedy qualifiers *?
, +?
, ??
, or {m,n}?
[…] match as little
text as possible.
Would not \(.*?\)
work? That is the non-greedy syntax.
>>> x = "a (b) c (d) e"
>>> re.search(r"(.*)", x).group()
'(b) c (d)'
>>> re.search(r"(.*?)", x).group()
'(b)'
The ‘*
‘, ‘+
‘, and ‘?
‘ qualifiers are all greedy; they match as much text as possible. Sometimes this behavior isn’t desired; if the RE <.*>
is matched against ‘<H1>title</H1>
‘, it will match the entire string, and not just ‘<H1>
‘. Adding ‘?
‘ after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using .*?
in the previous expression will match only ‘<H1>
‘.
Do you want it to match “(b)”? Do as Zitrax and Paolo have suggested. Do you want it to match “b”? Do
>>> x = "a (b) c (d) e"
>>> re.search(r"((.*?))", x).group(1)
'b'
Using an ungreedy match is a good start, but I’d also suggest that you reconsider any use of .*
— what about this?
groups = re.search(r"([^)]*)", x)
As the others have said using the ? modifier on the * quantifier will solve your immediate problem, but be careful, you are starting to stray into areas where regexes stop working and you need a parser instead. For instance, the string “(foo (bar)) baz” will cause you problems.
To start with, I do not suggest using “*” in regexes. Yes, I know, it is the most used multi-character delimiter, but it is nevertheless a bad idea. This is because, while it does match any amount of repetition for that character, “any” includes 0, which is usually something you want to throw a syntax error for, not accept. Instead, I suggest using the +
sign, which matches any repetition of length > 1. What’s more, from what I can see, you are dealing with fixed-length parenthesized expressions. As a result, you can probably use the {x, y}
syntax to specifically specify the desired length.
However, if you really do need non-greedy repetition, I suggest consulting the all-powerful ?
. This, when placed after at the end of any regex repetition specifier, will force that part of the regex to find the least amount of text possible.
That being said, I would be very careful with the ?
as it, like the Sonic Screwdriver in Dr. Who, has a tendency to do, how should I put it, “slightly” undesired things if not carefully calibrated. For example, to use your example input, it would identify ((1)
(note the lack of a second rparen) as a match.
How do I make a python regex like "(.*)"
such that, given "a (b) c (d) e"
python matches "b"
instead of "b) c (d"
?
I know that I can use "[^)]"
instead of "."
, but I’m looking for a more general solution that keeps my regex a little cleaner. Is there any way to tell python “hey, match this as soon as possible”?
You seek the all-powerful *?
From the docs, Greedy versus Non-Greedy
the non-greedy qualifiers
*?
,+?
,??
, or{m,n}?
[…] match as little
text as possible.
Would not \(.*?\)
work? That is the non-greedy syntax.
>>> x = "a (b) c (d) e"
>>> re.search(r"(.*)", x).group()
'(b) c (d)'
>>> re.search(r"(.*?)", x).group()
'(b)'
The ‘
*
‘, ‘+
‘, and ‘?
‘ qualifiers are all greedy; they match as much text as possible. Sometimes this behavior isn’t desired; if the RE<.*>
is matched against ‘<H1>title</H1>
‘, it will match the entire string, and not just ‘<H1>
‘. Adding ‘?
‘ after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using.*?
in the previous expression will match only ‘<H1>
‘.
Do you want it to match “(b)”? Do as Zitrax and Paolo have suggested. Do you want it to match “b”? Do
>>> x = "a (b) c (d) e"
>>> re.search(r"((.*?))", x).group(1)
'b'
Using an ungreedy match is a good start, but I’d also suggest that you reconsider any use of .*
— what about this?
groups = re.search(r"([^)]*)", x)
As the others have said using the ? modifier on the * quantifier will solve your immediate problem, but be careful, you are starting to stray into areas where regexes stop working and you need a parser instead. For instance, the string “(foo (bar)) baz” will cause you problems.
To start with, I do not suggest using “*” in regexes. Yes, I know, it is the most used multi-character delimiter, but it is nevertheless a bad idea. This is because, while it does match any amount of repetition for that character, “any” includes 0, which is usually something you want to throw a syntax error for, not accept. Instead, I suggest using the +
sign, which matches any repetition of length > 1. What’s more, from what I can see, you are dealing with fixed-length parenthesized expressions. As a result, you can probably use the {x, y}
syntax to specifically specify the desired length.
However, if you really do need non-greedy repetition, I suggest consulting the all-powerful ?
. This, when placed after at the end of any regex repetition specifier, will force that part of the regex to find the least amount of text possible.
That being said, I would be very careful with the ?
as it, like the Sonic Screwdriver in Dr. Who, has a tendency to do, how should I put it, “slightly” undesired things if not carefully calibrated. For example, to use your example input, it would identify ((1)
(note the lack of a second rparen) as a match.