Why isn't the regular expression's "non-capturing" group working?

Question

In the snippet below, the non-capturing group "(?:aaa)" should be ignored in the matching result,

The result should be "_bbb" only.

However, I get "aaa_bbb" in the matching result; only when I specify group(2) does it show "_bbb".

>>> import re
>>> s = "aaa_bbb"
>>> print(re.match(r"(?:aaa)(_bbb)", s).group())

aaa_bbb

Asked By: Jim Horng

||

Source

Answer 1

Try:

print(re.match(r"(?:aaa)(_bbb)", string1).group(1))

group() is same as group(0) and Group 0 is always present and it’s the whole RE match.

Answered By: codaddict

Answer 2

TFM:

class re.MatchObject

group([group1, ...])

Returns one or more subgroups of the match. If there is a single argument, the result is a single string; if there are multiple arguments, the result is a tuple with one item per argument. Without arguments, group1 defaults to zero (the whole match is returned). If a groupN argument is zero, the corresponding return value is the entire matching string.

Answered By: Pavel Minaev

Answer 3

You have to specify group(1) to get just the part captured by the parenthesis (_bbb in this case).

group() without parameters will return the whole string the complete regular expression matched, no matter if some parts of it were additionally captured by parenthesis or not.

Answered By: sth

Answer 4

group() and group(0) will return the entire match. Subsequent groups are actual capture groups.

>>> print (re.match(r"(?:aaa)(_bbb)", string1).group(0))
aaa_bbb
>>> print (re.match(r"(?:aaa)(_bbb)", string1).group(1))
_bbb
>>> print (re.match(r"(?:aaa)(_bbb)", string1).group(2))
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
IndexError: no such group

If you want the same behavior than group():

" ".join(re.match(r"(?:aaa)(_bbb)", string1).groups())

Answered By: Richard Simões

Answer 5

I think you’re misunderstanding the concept of a “non-capturing group”. The text matched by a non-capturing group still becomes part of the overall regex match.

Both the regex (?:aaa)(_bbb) and the regex (aaa)(_bbb) return aaa_bbb as the overall match. The difference is that the first regex has one capturing group which returns _bbb as its match, while the second regex has two capturing groups that return aaa and _bbb as their respective matches. In your Python code, to get _bbb, you’d need to use group(1) with the first regex, and group(2) with the second regex.

The main benefit of non-capturing groups is that you can add them to a regex without upsetting the numbering of the capturing groups in the regex. They also offer (slightly) better performance as the regex engine doesn’t have to keep track of the text matched by non-capturing groups.

If you really want to exclude aaa from the overall regex match then you need to use lookaround. In this case, positive lookbehind does the trick: (?<=aaa)_bbb. With this regex, group() returns _bbb in Python. No capturing groups needed.

My recommendation is that if you have the ability to use capturing groups to get part of the regex match, use that method instead of lookaround.

Answered By: Jan Goyvaerts

Answer 6

Use the groups method on the match object instead of group. It returns a list of all capture buffers. The group method with no argument is returning the entire match of the regular expression.

Answered By: Matt

Answer 7

To go along with Jan Goyvaert’s answer:

The main benefit of non-capturing groups is that you can add them to a regex without upsetting the numbering of the capturing groups in the regex.

Why add a group that doesn’t upset the numbering of the groups?

to organize your expression… make it easier to read
allow for there being defining characteristics that you ultimately don’t want
to give precedence to other operators

OP’s example, r"(?:aaa)(_bbb)", draws my eye (and brain) to "aaa" but there’s nothing special about it… to me, it merely seems adjacent to what really matters, "_bbb".

I think it should actually be one of either:

r"aaa(_bbb)": if "aaa" isn’t important to how you/we read the regex, "aaa" has no meaning to us
r"(?:aaa(_bbb))": if "aaa_bbb" is a single thing we should (mentally) consider as a whole, but in the end we only use "_bbb"

All three are equivalent from the perspective of the regex engine:

s = "aaa_bbb"
for pattern in [r"(?:aaa)(_bbb)", r"aaa(_bbb)", r"(?:aaa(_bbb))"]:
    m = re.match(pattern, s)

    print(pattern)
    print("=" * len(pattern))
    print(f"groups(): {m.groups()}")
    for i in range(1, len(m.groups()) + 1):
        print(f"  group({i}): {m.group(i)}")
    print()

(?:aaa)(_bbb)
=============
groups(): ('_bbb',)
  group(1): _bbb

aaa(_bbb)
=========
groups(): ('_bbb',)
  group(1): _bbb

(?:aaa(_bbb))
=============
groups(): ('_bbb',)
  group(1): _bbb

For a real world example, I want to parse the output from a Unix-like time command:

% /usr/bin/time -l foo
...
        9.87 real         6.54 user         3.21 sys
...

My original Python regex was:

import re

line = "        9.87 real         6.54 user         3.21 sys"

pattern = r"s+(d+.d+) reals+(d+.d+) users+(d+.d+) sys"

which gives me the correct groupings:

m = re.match(pattern, line)

print(m.groups())  # ('9.87', '6.54', '3.21')
print(m.group(1))  # 9.87
print(m.group(2))  # 6.54
print(m.group(3))  # 3.21

But the labels at the end of the value kept throwing my eyes off, and I kept seeing "reals+(d+.d+)" instead of "(d+.d+) real".

I could try to add parens to visually distinguish the "meta groups", which is a little better, visually, for me:

pattern = r"s+((d+.d+) real)s+((d+.d+) user)s+((d+.d+) sys)"

but that breaks the logic of the groupings I want:

m = re.match(pattern, line)

print(m.groups())  # ('9.87 real', '9.87', '6.54 user', '6.54', '3.21 sys', '3.21')
print(m.group(1))  # 9.87 real
print(m.group(2))  # 9.87
print(m.group(3))  # 6.54 user
print(m.group(4))  # 6.54
print(m.group(5))  # 3.21 sys
print(m.group(6))  # 3.21

Making those meta groups non grouping gives me a better visual (my eyes still do some back-and-forth, but at least there are clear boundaries) for my meta groups:

pattern = r"s+(?:(d+.d+) real)s+(?:(d+.d+) user)s+(?:(d+.d+) sys)"

and it leaves logic of the grouping alone:

m = re.match(pattern, line)

print(m.groups())  # ('9.87', '6.54', '3.21')
print(m.group(1))  # 9.87
print(m.group(2))  # 6.54
print(m.group(3))  # 3.21

Answered By: Zach Young

Why isn't the regular expression's "non-capturing" group working?

Question:

Answers: