Why doesn't work in Python regexp substitutions, i.e. with sub() or expand(), while match.group(0) does, and also 1, 2, …?

Question:

Why doesn’t work (i.e. to return the full match) in Python regexp substitutions, i.e. with sub() or match.expand(), while match.group(0) does, and also 1, 2, … ?

This simple example (executed in Python 3.7) says it all:

import re

subject = '123'
regexp_pattern = r'd(2)d'
expand_template_full = r''
expand_template_group = r'1'

regexp_obj = re.compile(regexp_pattern)

match = regexp_obj.search(subject)
if match:
    print('Full match, by method: {}'.format(match.group(0)))
    print('Full match, by template: {}'.format(match.expand(expand_template_full)))
    print('Capture group 1, by method: {}'.format(match.group(1)))
    print('Capture group 1, by template: {}'.format(match.expand(expand_template_group)))

The output from this is:

Full match, by method: 123
Full match, by template: 
Capture group 1, by method: 2
Capture group 1, by template: 2

Is there any other sequence I can use in the replacement/expansion template to get the full match? If not, for the love of god, why?

Is this a Python bug?

Asked By: QuestionOverflow

||

Answers:

If you will look into docs, you will find next:

The backreference g<0> substitutes in the entire substring matched by the RE.

A bit more deep in docs (back in 2003) you will find next tip:

There is a group 0, which is the entire matched pattern, but it can’t be referenced with ; instead, use g<0>.

So, you need to follow this recommendations and use g<0>:

expand_template_full = r'g<0>'
Answered By: Olvin Roght

Huh, you’re right, that is annoying!

Fortunately, Python’s way ahead of you. The docs for sub say this:

In string-type repl arguments, in addition to the character escapes and backreferences described above, g<name> will use the substring matched by the group named name, as defined by the (?P<name>...) syntax. g<number> uses the corresponding group number…. The backreference g<0> substitutes in the entire substring matched by the RE.

So your code example can be:

import re

subject = '123'
regexp_pattern = r'd(2)d'
expand_template_full = r'g<0>'

regexp_obj = re.compile(regexp_pattern)

match = regexp_obj.search(subject)
if match:
    print('Full match, by template: {}'.format(match.expand(expand_template_full)))

You also asked the far more interesting question of “why?”. The rationale in the docs explains that you can use this to replace with more than 10 capture groups, because it’s not clear whether 10 should be substituted with the 10th group, or with the first capture group followed by a zero, but doesn’t explain why doesn’t work. I’ve not been able to find a PEP explaining the rationale, but here’s my guess:

We want the repl argument to re.sub to use the same capture group backreferencing syntax as in regex matching. When regex matching, the concept of “backreferencing” to the entire matched string is nonsensical; the hypothetical regex r'A' would match an infinitely long string of A characters and nothing else. So we cannot allow to exist as a backreference. If you can’t match with a backreference that looks like that, you shouldn’t be able to replace with it either.

I can’t say I agree with this logic, g<> is already an arbitrary extension, but it’s an argument that I can see someone making.

Answered By: ymbirtt

Quoting from https://docs.python.org/3/library/re.html

number

Matches the contents of the group of the same number. Groups are numbered starting from 1. For example, (.+) 1 matches ‘the the’ or ’55 55′, but not ‘thethe’ (note the space after the group). This special sequence can only be used to match one of the first 99 groups. If the first digit of number is 0, or number is 3 octal digits long, it will not be interpreted as a group match, but as the character with octal value number. Inside the ‘[‘ and ‘]’ of a character class, all numeric escapes are treated as characters.

To summarize:

  • Use 1, 2 up to 99 provided no more digits are present after the numbered backreference
  • Use g<0>, g<1>, etc (not limited to 99) to robustly backreference a group
    • as far as I know, g<0> is useful in replacement section to refer to entire matched portion but wouldn’t make sense in search section
    • if you use the 3rd party regex module, then (?0) is useful in search section as well, for example to create recursively matching patterns
Answered By: Sundeep