error "unmatched group" when using re.sub in Python 2.7

Question

I have a list of strings. Each element represents a field as key value separated by space:

listA = [
'abcd1-2 4d4e',
'xyz0-1 551',
'foo 3ea',
'bar1 2bd',
'mc-mqisd0-2 77a'
]

Behavior

I need to return a dict out of this list with expanding the keys like 'xyz0-1' by the range denoted by 0-1 into multiple keys like abcd1 and abcd2 with the same value like 4d4e.

It should run as part of an Ansible plugin, where Python 2.7 is used.

Expected

The end result would look like the dict below:

{
abcd1: 4d4e,
abcd2: 4d4e,
xyz0: 551,
xyz1: 551,
foo: 3ea,
bar1: 2bd,
mc-mqisd0: 77a,
mc-mqisd1: 77a,
mc-mqisd2: 77a,
}

Code

I have created below function. It is working with Python 3.

  def listFln(listA):
    import re
    fL = []
    for i in listA:
      aL = i.split()[0]
      bL = i.split()[1]
      comp = re.sub('^(.+?)(d+-d+)?$',r'1',aL)
      cmpCountR = re.sub('^(.+?)(d+-d+)?$',r'2',aL)
      if cmpCountR.strip():
        nStart = int(cmpCountR.split('-')[0])
        nEnd = int(cmpCountR.split('-')[1])
        for j in range(nStart,nEnd+1):
          fL.append(comp + str(j) + ' ' + bL)
      else:
        fL.append(i)

    return(dict([k.split() for k in fL]))

Error

In lower python versions like Python 2.7. this code throws an "unmatched group" error:

    cmpCountR = re.sub('^(.+?)(d+-d+)?$',r'2',aL)
  File "/usr/lib64/python2.7/re.py", line 151, in sub
    return _compile(pattern, flags).sub(repl, string, count)
  File "/usr/lib64/python2.7/re.py", line 275, in filter
    return sre_parse.expand_template(template, match)
  File "/usr/lib64/python2.7/sre_parse.py", line 800, in expand_template
    raise error, "unmatched group"

Anything wrong with the regex here?

Asked By: Vijesh

||

Source

Answer 1

Used Python 2.7 to reproduce. This answer shows the issue with not found backreferences for re.sub in Python 2.7 and some patterns to fix.

Both patterns compile

import re

# both seem identical
regex1 = '^(.+?)(d+-d+)?$'
regex2 = '^(.+?)(d+-d+)?$'

# also the compiled pattern is identical, see hash
re.compile(regex1)  # <_sre.SRE_Pattern object at 0x7f575ef8fd40>
re.compile(regex2)  # <_sre.SRE_Pattern object at 0x7f575ef8fd40>

Note: The compiled pattern using re.compile() saves time when re-using multiple times like in this loop.

Fix: test for groups found

The error-message indicates that there are groups that aren’t matched.
Put it other: In the matching result of re.sub (docs to 2.7) there are references to groups like the second capturing group (2) that have not been found or captured in the given string input:

sre_constants.error: unmatched group

To fix this, we should test on groups that were found in the match.
Therefore we use re.match(regex, str) or the compiled variant pattern.match(str) to create a Match object, then Match.groups() to return all found groups as tuple.

import re

regex = '^(.+?)(d+-d+)?$'  # a key followed by optional digits-range
pattern = re.compile(regex)  # <_sre.SRE_Pattern object at 0x7f575ef8fd40>


def dict_with_expanded_digits(fields_list):
    entry_list = []
    for fields in fields_list:
        (key_digits_range, value) = fields.split()  # a pair of ('key0-1', 'value') 

        # test for match and groups found
        match = pattern.match(key_digits_range)
        print("DEBUG: groups:", match.groups())  # tuple containing all the subgroups of the match,
        # watch: the 3rd iteration has only group(1), while group(2) is None
        
        # break to next iteration here, if not maching pattern
        if not match:
            print('ERROR: no valid key! Will not add to dict.', fields)
            continue

        # if no  2nd group, only a single key,value
        if not match.group(2):
            print('WARN: key without range! Will add as single entry:', fields)
            entry_list.append( (key_digits_range, value) )
            continue  # stop iteration here and continue with next
            
        key = pattern.sub(r'1', key_digits_range)
        index_range = pattern.sub(r'2', key_digits_range)
        
        # no strip needed here
        (start, end) = index_range.split('-')
        for index in range(int(start), int(end)+1):
            expanded_key = "{}{}".format(key, index)
            entry = (expanded_key, value)  # use tuple for each field entry (key, value)
            entry_list.append(entry)

    return dict([e for e in entry_list])


list_a = [
  'abcd1-2 4d4e',  # 2 entries
  'xyz0-1 551',   # 2 entries
  'foo 3ea',  # 1 entry
  'bar1 2bd',   # 1 entry
  'mc-mqisd0-2 77a'  # 3 entries
]

dict_a = dict_with_expanded_digits(list_a)
print("INFO: resulting dict with length: ", len(dict_a), dict_a)

assert len(dict_a) == 9

Prints:

('DEBUG: groups:', ('abcd', '1-2'))
('DEBUG: groups:', ('xyz', '0-1'))
('DEBUG: groups:', ('foo', None))
('WARN: key without range! Will add as single entry:', 'foo 3ea')
('DEBUG: groups:', ('bar1', None))
('WARN: key without range! Will add as single entry:', 'bar1 2bd')
('DEBUG: groups:', ('mc-mqisd', '0-2'))
('INFO: resulting dict with length: ', 9, {'bar1': '2bd', 'foo': '3ea', 'mc-mqisd2': '77a', 'mc-mqisd0': '77a', 'mc-mqisd1': '77a', 'xyz1': '551', 'xyz0': '551', 'abcd1': '4d4e', 'abcd2': '4d4e'})

Note on added improvements

renamed function and variables to express intend
used tuples where possible, e.g. assignment (start, end)
instead of re. methods used the equivalent methods of compiled pattern pattern.
the guard-statement if not match.group(2): avoids expanding the field and just adds the key-value as is
added assert to verify given list of 7 is expanded to dict of 9 as expected

Answered By: hc_dev

Answer 2

Here’s a simpler version using findall instead of sub, successfully tested on 2,7. It also directly creates the dict instead of first building a list:

mylist=[
'abcd1-2 4d4e',
'xyz0-1 551',
'foo 3ea',
'bar1 2bd',
'mc-mqisd0-2 77a'
]

def listFln(listA):
    import re
    fL = {}
    for i in listA:
        aL = i.split()[0]
        bL = i.split()[1]
        comp = re.findall('^(.+?)(d+-d+)?$',aL)[0]
        if comp[1]:
            nStart = int(comp[1].split('-')[0])
            nEnd = int(comp[1].split('-')[1])
            for j in range(nStart,nEnd+1):
                fL[comp[0]+str(j)] = bL
        else:
            fL[comp[0]] = bL
    return fL
    
print(listFln(mylist))
# {'abcd1': '4d4e',
#  'abcd2': '4d4e',
#  'xyz0': '551',
#  'xyz1': '551',
#  'foo': '3ea',
#  'bar1': '2bd',
#  'mc-mqisd0': '77a',
#  'mc-mqisd1': '77a',
#  'mc-mqisd2': '77a'}

Answered By: Swifty

Answer 3

You could use a single pattern with 4 capture groups, and check if the 3rd capture group value is not empty.

^(S*?)(?:(d+)-(d+))?s+(.*)

The pattern matches:

^ Start of string
S*?) Capture group 1, match optional non whitespace chars, as few as possible
(?:(d+)-(d+))? Optionally capture 1+ digits in group 2 and group 3 with a - in between
(.*) Capture group 4, match the rest of the line

Regex demo | Python demo

Code example (works on Python 2 and Python 3)

import re

strings = [
    'abcd1-2 4d4e',
    'xyz0-1 551',
    'foo 3ea',
    'bar1 2bd',
    'mc-mqisd0-2 77a'
]

def listFln(listA):
    dct = {}
    for s in listA:
        lst = sum(re.findall(r"^(S*?)(?:(d+)-(d+))?s+(.*)", s), ())
        if lst and lst[2]:
            for i in range(int(lst[1]), int(lst[2]) + 1):
                dct[lst[0] + str(i)] = lst[3]
        else:
            dct[lst[0]] = lst[3]
    return dct


print(listFln(strings))

Output

{
    'abcd1': '4d4e',
    'abcd2': '4d4e',
    'xyz0': '551',
    'xyz1': '551',
    'foo': '3ea',
    'bar1': '2bd',
    'mc-mqisd0': '77a',
    'mc-mqisd1': '77a',
    'mc-mqisd2': '77a'
}

Answered By: The fourth bird