Error "sre_constants.error: unmatched group" when using Pattern.sub(r'1'

Question:

I know there have been several questions on this subject already, but none help me resolve my problem.

I have to replace names in a CSV document when they follow the tags {SPEAKER} or {GROUP OF SPEAKERS}.

Code

The erroneous part of my script is:

list_speakers = re.compile(r'^{GROUP OF SPEAKERS}t(.*)|^{SPEAKER}t(.*)')

usernames = set()
for f in corpus:
    with open(f, "r", encoding=encoding) as fin:
        line = fin.readline()
        while line:
            line = line.rstrip()
            if not line:
                line = fin.readline()
                continue

            if not list_speakers.match(line):
                line = fin.readline()
                continue

            names = list_speakers.sub(r'1', line)
            names = names.split(", ")
            for name in names:
                usernames.add(name)

            line = fin.readline()

Error

However, I receive the following error message :

File "/usr/lib/python2.7/re.py", line 291, in filter
    return sre_parse.expand_template(template, match)
  File "/usr/lib/python2.7/sre_parse.py", line 831, in expand_template
    raise error, "unmatched group"
sre_constants.error: unmatched group

I am using Python 2.7.

How can I fix this?

Asked By: Basile

||

Answers:

The issue is a known one: if the group was not initialized, the backreference is not set to an empty string in Python versions up to 3.5.

You need to make sure there is only one or use a lambda expression as the replacement argument to implement custom replacement logic.

Here, you can easily revampt the regex into a pattern with a single capturing group:

r'^{(?:GROUP OF SPEAKERS|SPEAKER)}t(.*)'

See the regex demo

Details

  • ^ – start of string
  • { – a {
  • (?:GROUP OF SPEAKERS|SPEAKER) – a non-capturing group matching either GROUP OF SPEAKERS or SPEAKER
  • } – a } (you may also write }, it does not need escaping)
  • t – a tab char
  • (.*) – Group 1: any 0+ chars other than line break chars, as many as possible (the rest of the line).