error "unmatched group" when using re.sub in Python 2.7
Question:
I have a list of strings. Each element represents a field as key value separated by space:
listA = [
'abcd1-2 4d4e',
'xyz0-1 551',
'foo 3ea',
'bar1 2bd',
'mc-mqisd0-2 77a'
]
Behavior
I need to return a dict
out of this list with expanding the keys like 'xyz0-1'
by the range denoted by 0-1 into multiple keys like abcd1
and abcd2
with the same value like 4d4e
.
It should run as part of an Ansible plugin, where Python 2.7 is used.
Expected
The end result would look like the dict below:
{
abcd1: 4d4e,
abcd2: 4d4e,
xyz0: 551,
xyz1: 551,
foo: 3ea,
bar1: 2bd,
mc-mqisd0: 77a,
mc-mqisd1: 77a,
mc-mqisd2: 77a,
}
Code
I have created below function. It is working with Python 3.
def listFln(listA):
import re
fL = []
for i in listA:
aL = i.split()[0]
bL = i.split()[1]
comp = re.sub('^(.+?)(d+-d+)?$',r'1',aL)
cmpCountR = re.sub('^(.+?)(d+-d+)?$',r'2',aL)
if cmpCountR.strip():
nStart = int(cmpCountR.split('-')[0])
nEnd = int(cmpCountR.split('-')[1])
for j in range(nStart,nEnd+1):
fL.append(comp + str(j) + ' ' + bL)
else:
fL.append(i)
return(dict([k.split() for k in fL]))
Error
In lower python versions like Python 2.7. this code throws an "unmatched group" error:
cmpCountR = re.sub('^(.+?)(d+-d+)?$',r'2',aL)
File "/usr/lib64/python2.7/re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
File "/usr/lib64/python2.7/re.py", line 275, in filter
return sre_parse.expand_template(template, match)
File "/usr/lib64/python2.7/sre_parse.py", line 800, in expand_template
raise error, "unmatched group"
Anything wrong with the regex here?
Answers:
Used Python 2.7 to reproduce. This answer shows the issue with not found backreferences for re.sub
in Python 2.7 and some patterns to fix.
Both patterns compile
import re
# both seem identical
regex1 = '^(.+?)(d+-d+)?$'
regex2 = '^(.+?)(d+-d+)?$'
# also the compiled pattern is identical, see hash
re.compile(regex1) # <_sre.SRE_Pattern object at 0x7f575ef8fd40>
re.compile(regex2) # <_sre.SRE_Pattern object at 0x7f575ef8fd40>
Note: The compiled pattern using re.compile()
saves time when re-using multiple times like in this loop.
Fix: test for groups found
The error-message indicates that there are groups that aren’t matched.
Put it other: In the matching result of re.sub
(docs to 2.7) there are references to groups like the second capturing group (2
) that have not been found or captured in the given string input:
sre_constants.error: unmatched group
To fix this, we should test on groups that were found in the match.
Therefore we use re.match(regex, str)
or the compiled variant pattern.match(str)
to create a Match
object, then Match.groups()
to return all found groups as tuple.
import re
regex = '^(.+?)(d+-d+)?$' # a key followed by optional digits-range
pattern = re.compile(regex) # <_sre.SRE_Pattern object at 0x7f575ef8fd40>
def dict_with_expanded_digits(fields_list):
entry_list = []
for fields in fields_list:
(key_digits_range, value) = fields.split() # a pair of ('key0-1', 'value')
# test for match and groups found
match = pattern.match(key_digits_range)
print("DEBUG: groups:", match.groups()) # tuple containing all the subgroups of the match,
# watch: the 3rd iteration has only group(1), while group(2) is None
# break to next iteration here, if not maching pattern
if not match:
print('ERROR: no valid key! Will not add to dict.', fields)
continue
# if no 2nd group, only a single key,value
if not match.group(2):
print('WARN: key without range! Will add as single entry:', fields)
entry_list.append( (key_digits_range, value) )
continue # stop iteration here and continue with next
key = pattern.sub(r'1', key_digits_range)
index_range = pattern.sub(r'2', key_digits_range)
# no strip needed here
(start, end) = index_range.split('-')
for index in range(int(start), int(end)+1):
expanded_key = "{}{}".format(key, index)
entry = (expanded_key, value) # use tuple for each field entry (key, value)
entry_list.append(entry)
return dict([e for e in entry_list])
list_a = [
'abcd1-2 4d4e', # 2 entries
'xyz0-1 551', # 2 entries
'foo 3ea', # 1 entry
'bar1 2bd', # 1 entry
'mc-mqisd0-2 77a' # 3 entries
]
dict_a = dict_with_expanded_digits(list_a)
print("INFO: resulting dict with length: ", len(dict_a), dict_a)
assert len(dict_a) == 9
Prints:
('DEBUG: groups:', ('abcd', '1-2'))
('DEBUG: groups:', ('xyz', '0-1'))
('DEBUG: groups:', ('foo', None))
('WARN: key without range! Will add as single entry:', 'foo 3ea')
('DEBUG: groups:', ('bar1', None))
('WARN: key without range! Will add as single entry:', 'bar1 2bd')
('DEBUG: groups:', ('mc-mqisd', '0-2'))
('INFO: resulting dict with length: ', 9, {'bar1': '2bd', 'foo': '3ea', 'mc-mqisd2': '77a', 'mc-mqisd0': '77a', 'mc-mqisd1': '77a', 'xyz1': '551', 'xyz0': '551', 'abcd1': '4d4e', 'abcd2': '4d4e'})
Note on added improvements
- renamed function and variables to express intend
- used tuples where possible, e.g. assignment
(start, end)
- instead of
re.
methods used the equivalent methods of compiled pattern pattern.
- the guard-statement
if not match.group(2):
avoids expanding the field and just adds the key-value as is
- added
assert
to verify given list of 7 is expanded to dict of 9 as expected
Here’s a simpler version using findall
instead of sub
, successfully tested on 2,7. It also directly creates the dict instead of first building a list:
mylist=[
'abcd1-2 4d4e',
'xyz0-1 551',
'foo 3ea',
'bar1 2bd',
'mc-mqisd0-2 77a'
]
def listFln(listA):
import re
fL = {}
for i in listA:
aL = i.split()[0]
bL = i.split()[1]
comp = re.findall('^(.+?)(d+-d+)?$',aL)[0]
if comp[1]:
nStart = int(comp[1].split('-')[0])
nEnd = int(comp[1].split('-')[1])
for j in range(nStart,nEnd+1):
fL[comp[0]+str(j)] = bL
else:
fL[comp[0]] = bL
return fL
print(listFln(mylist))
# {'abcd1': '4d4e',
# 'abcd2': '4d4e',
# 'xyz0': '551',
# 'xyz1': '551',
# 'foo': '3ea',
# 'bar1': '2bd',
# 'mc-mqisd0': '77a',
# 'mc-mqisd1': '77a',
# 'mc-mqisd2': '77a'}
You could use a single pattern with 4 capture groups, and check if the 3rd capture group value is not empty.
^(S*?)(?:(d+)-(d+))?s+(.*)
The pattern matches:
^
Start of string
S*?)
Capture group 1, match optional non whitespace chars, as few as possible
(?:(d+)-(d+))?
Optionally capture 1+ digits in group 2 and group 3 with a -
in between
(.*)
Capture group 4, match the rest of the line
Code example (works on Python 2 and Python 3)
import re
strings = [
'abcd1-2 4d4e',
'xyz0-1 551',
'foo 3ea',
'bar1 2bd',
'mc-mqisd0-2 77a'
]
def listFln(listA):
dct = {}
for s in listA:
lst = sum(re.findall(r"^(S*?)(?:(d+)-(d+))?s+(.*)", s), ())
if lst and lst[2]:
for i in range(int(lst[1]), int(lst[2]) + 1):
dct[lst[0] + str(i)] = lst[3]
else:
dct[lst[0]] = lst[3]
return dct
print(listFln(strings))
Output
{
'abcd1': '4d4e',
'abcd2': '4d4e',
'xyz0': '551',
'xyz1': '551',
'foo': '3ea',
'bar1': '2bd',
'mc-mqisd0': '77a',
'mc-mqisd1': '77a',
'mc-mqisd2': '77a'
}
I have a list of strings. Each element represents a field as key value separated by space:
listA = [
'abcd1-2 4d4e',
'xyz0-1 551',
'foo 3ea',
'bar1 2bd',
'mc-mqisd0-2 77a'
]
Behavior
I need to return a dict
out of this list with expanding the keys like 'xyz0-1'
by the range denoted by 0-1 into multiple keys like abcd1
and abcd2
with the same value like 4d4e
.
It should run as part of an Ansible plugin, where Python 2.7 is used.
Expected
The end result would look like the dict below:
{
abcd1: 4d4e,
abcd2: 4d4e,
xyz0: 551,
xyz1: 551,
foo: 3ea,
bar1: 2bd,
mc-mqisd0: 77a,
mc-mqisd1: 77a,
mc-mqisd2: 77a,
}
Code
I have created below function. It is working with Python 3.
def listFln(listA):
import re
fL = []
for i in listA:
aL = i.split()[0]
bL = i.split()[1]
comp = re.sub('^(.+?)(d+-d+)?$',r'1',aL)
cmpCountR = re.sub('^(.+?)(d+-d+)?$',r'2',aL)
if cmpCountR.strip():
nStart = int(cmpCountR.split('-')[0])
nEnd = int(cmpCountR.split('-')[1])
for j in range(nStart,nEnd+1):
fL.append(comp + str(j) + ' ' + bL)
else:
fL.append(i)
return(dict([k.split() for k in fL]))
Error
In lower python versions like Python 2.7. this code throws an "unmatched group" error:
cmpCountR = re.sub('^(.+?)(d+-d+)?$',r'2',aL)
File "/usr/lib64/python2.7/re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
File "/usr/lib64/python2.7/re.py", line 275, in filter
return sre_parse.expand_template(template, match)
File "/usr/lib64/python2.7/sre_parse.py", line 800, in expand_template
raise error, "unmatched group"
Anything wrong with the regex here?
Used Python 2.7 to reproduce. This answer shows the issue with not found backreferences for re.sub
in Python 2.7 and some patterns to fix.
Both patterns compile
import re
# both seem identical
regex1 = '^(.+?)(d+-d+)?$'
regex2 = '^(.+?)(d+-d+)?$'
# also the compiled pattern is identical, see hash
re.compile(regex1) # <_sre.SRE_Pattern object at 0x7f575ef8fd40>
re.compile(regex2) # <_sre.SRE_Pattern object at 0x7f575ef8fd40>
Note: The compiled pattern using re.compile()
saves time when re-using multiple times like in this loop.
Fix: test for groups found
The error-message indicates that there are groups that aren’t matched.
Put it other: In the matching result of re.sub
(docs to 2.7) there are references to groups like the second capturing group (2
) that have not been found or captured in the given string input:
sre_constants.error: unmatched group
To fix this, we should test on groups that were found in the match.
Therefore we use re.match(regex, str)
or the compiled variant pattern.match(str)
to create a Match
object, then Match.groups()
to return all found groups as tuple.
import re
regex = '^(.+?)(d+-d+)?$' # a key followed by optional digits-range
pattern = re.compile(regex) # <_sre.SRE_Pattern object at 0x7f575ef8fd40>
def dict_with_expanded_digits(fields_list):
entry_list = []
for fields in fields_list:
(key_digits_range, value) = fields.split() # a pair of ('key0-1', 'value')
# test for match and groups found
match = pattern.match(key_digits_range)
print("DEBUG: groups:", match.groups()) # tuple containing all the subgroups of the match,
# watch: the 3rd iteration has only group(1), while group(2) is None
# break to next iteration here, if not maching pattern
if not match:
print('ERROR: no valid key! Will not add to dict.', fields)
continue
# if no 2nd group, only a single key,value
if not match.group(2):
print('WARN: key without range! Will add as single entry:', fields)
entry_list.append( (key_digits_range, value) )
continue # stop iteration here and continue with next
key = pattern.sub(r'1', key_digits_range)
index_range = pattern.sub(r'2', key_digits_range)
# no strip needed here
(start, end) = index_range.split('-')
for index in range(int(start), int(end)+1):
expanded_key = "{}{}".format(key, index)
entry = (expanded_key, value) # use tuple for each field entry (key, value)
entry_list.append(entry)
return dict([e for e in entry_list])
list_a = [
'abcd1-2 4d4e', # 2 entries
'xyz0-1 551', # 2 entries
'foo 3ea', # 1 entry
'bar1 2bd', # 1 entry
'mc-mqisd0-2 77a' # 3 entries
]
dict_a = dict_with_expanded_digits(list_a)
print("INFO: resulting dict with length: ", len(dict_a), dict_a)
assert len(dict_a) == 9
Prints:
('DEBUG: groups:', ('abcd', '1-2'))
('DEBUG: groups:', ('xyz', '0-1'))
('DEBUG: groups:', ('foo', None))
('WARN: key without range! Will add as single entry:', 'foo 3ea')
('DEBUG: groups:', ('bar1', None))
('WARN: key without range! Will add as single entry:', 'bar1 2bd')
('DEBUG: groups:', ('mc-mqisd', '0-2'))
('INFO: resulting dict with length: ', 9, {'bar1': '2bd', 'foo': '3ea', 'mc-mqisd2': '77a', 'mc-mqisd0': '77a', 'mc-mqisd1': '77a', 'xyz1': '551', 'xyz0': '551', 'abcd1': '4d4e', 'abcd2': '4d4e'})
Note on added improvements
- renamed function and variables to express intend
- used tuples where possible, e.g. assignment
(start, end)
- instead of
re.
methods used the equivalent methods of compiled patternpattern.
- the guard-statement
if not match.group(2):
avoids expanding the field and just adds the key-value as is - added
assert
to verify given list of 7 is expanded to dict of 9 as expected
Here’s a simpler version using findall
instead of sub
, successfully tested on 2,7. It also directly creates the dict instead of first building a list:
mylist=[
'abcd1-2 4d4e',
'xyz0-1 551',
'foo 3ea',
'bar1 2bd',
'mc-mqisd0-2 77a'
]
def listFln(listA):
import re
fL = {}
for i in listA:
aL = i.split()[0]
bL = i.split()[1]
comp = re.findall('^(.+?)(d+-d+)?$',aL)[0]
if comp[1]:
nStart = int(comp[1].split('-')[0])
nEnd = int(comp[1].split('-')[1])
for j in range(nStart,nEnd+1):
fL[comp[0]+str(j)] = bL
else:
fL[comp[0]] = bL
return fL
print(listFln(mylist))
# {'abcd1': '4d4e',
# 'abcd2': '4d4e',
# 'xyz0': '551',
# 'xyz1': '551',
# 'foo': '3ea',
# 'bar1': '2bd',
# 'mc-mqisd0': '77a',
# 'mc-mqisd1': '77a',
# 'mc-mqisd2': '77a'}
You could use a single pattern with 4 capture groups, and check if the 3rd capture group value is not empty.
^(S*?)(?:(d+)-(d+))?s+(.*)
The pattern matches:
^
Start of stringS*?)
Capture group 1, match optional non whitespace chars, as few as possible(?:(d+)-(d+))?
Optionally capture 1+ digits in group 2 and group 3 with a-
in between(.*)
Capture group 4, match the rest of the line
Code example (works on Python 2 and Python 3)
import re
strings = [
'abcd1-2 4d4e',
'xyz0-1 551',
'foo 3ea',
'bar1 2bd',
'mc-mqisd0-2 77a'
]
def listFln(listA):
dct = {}
for s in listA:
lst = sum(re.findall(r"^(S*?)(?:(d+)-(d+))?s+(.*)", s), ())
if lst and lst[2]:
for i in range(int(lst[1]), int(lst[2]) + 1):
dct[lst[0] + str(i)] = lst[3]
else:
dct[lst[0]] = lst[3]
return dct
print(listFln(strings))
Output
{
'abcd1': '4d4e',
'abcd2': '4d4e',
'xyz0': '551',
'xyz1': '551',
'foo': '3ea',
'bar1': '2bd',
'mc-mqisd0': '77a',
'mc-mqisd1': '77a',
'mc-mqisd2': '77a'
}