Python re find start and end index of group match

Question:

Python’s re match objects have .start() and .end() methods on the match object.
I want to find the start and end index of a group match. How can I do this?
Example:

>>> import re
>>> REGEX = re.compile(r'h(?P<num>[0-9]{3})p')
>>> test = "hello h889p something"
>>> match = REGEX.search(test)
>>> match.group('num')
'889'
>>> match.start()
6
>>> match.end()
11
>>> match.group('num').start()                  # just trying this. Didn't work
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'start'
>>> REGEX.groupindex
mappingproxy({'num': 1})                        # this is the index of the group in the regex, not the index of the group match, so not what I'm looking for.

The expected output above is (7, 10)

Asked By: Neil

||

Answers:

You could just use string indexing and the index() method:

>>> import re
>>> REGEX = re.compile(r'h(?P<num>[0-9]{3})p')
>>> test = "hello h889p something"
>>> match = REGEX.search(test)
>>> test.index(match.group('num')[0])
7
>>> test.index(match.group('num')[-1])
9

If you want the results as a tuple:

>>> str_match = match.group("num")
>>> results = (test.index(str_match[0]), test.index(str_match[-1]))
>>> results
(7, 9)

Note: As Tom pointed out, you may want to consider using results = (test.index(str_match), text.index(str_match)+len(str_match)) in order to prevent bugs which may arise from the string having identical characters. For example, if the number were 899, then results would be (7, 8), since the first instance of 9 is at index 8.

Answered By: Jacob Lee

A slight modification on the existing answer is to use index to find the whole group, rather than the starting and ending characters of the group:

import re
REGEX = re.compile(r'h(?P<num>[0-9]{3})p')
test = "hello h889p something"
match = REGEX.search(test)
group = match.group('num')

# modification here to find the start point
idx = test.index(group)

# find the end point using len of group
output = (idx, idx + len(group)) #(7, 10)

This checks for the whole string "889" when determining the index. So there is a little less potential for error then checking for the first 8 and the first 9, though it is still not perfect (i.e. if "889" appears earlier in the string, not surrounded by "h" and "p").

Answered By: Tom

A workaround for the given example could be using lookarounds:

import re
REGEX = re.compile(r'(?<=h)[0-9]{3}(?=p)')
test = "hello h889p something"
match = REGEX.search(test)
print(match)

Output

<re.Match object; span=(7, 10), match='889'>
Answered By: The fourth bird

You can provide Match.start (and Match.end) with a group name to get the start (end) position of a group:

>>> import re
>>> REGEX = re.compile(r'h(?P<num>[0-9]{3})p')
>>> test = "hello h889p something"
>>> match = REGEX.search(test)
>>> match.start('num')
7
>>> match.end('num')
10

An advantage of this approach over using str.index as suggested in other answers is that you do not run into problems if the group string occurs multiple times.

Answered By: jfschaefer
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.