Get start location of capturing group within regex pattern

Question

Basically, I want to find the index for the first occurrence of any of the substrings: “ABC”, “DEF”, or “GHI”, so long as they occur in an interval of three. The regex that I wrote to match this pattern is:

regex = compile ("(?:[a-zA-Z]{3})*?(ABC|DEF|GHI)")

The *? ensures that I get the first match, since it’s non-greedy. I’m using a capturing group since I assume that that is the only way to actually get the index (of the substring) that I’m actually looking for. I don’t care where the match itself starts, just where the capturing group starts. The ...{3}... mandates that the pattern occur in an interval of 3, ie:

example_1 = "BNDABCDJML"

example_2 = "JKMJABCKME"

example_1 would match since "ABC" occurs at position 3 but example_2 would not match since "ABC" occurs at position 4.

Ideally, given the string:

text = "STCABCFFC"

this matches, but if I simply get the start of the match, it will give me 0, since that’s the beginning index of the match, where what I want is 3

I’d like to do this:

print match(regex, text).group(1).start()

but, of course, this doesn’t work, since start() is not a method for strings, plus the string is now independent of text. I can’t simply search for the starting index of the substring in the capturing group, because that won’t guarantee me that it follows the regex pattern (only occur in intervals of 3). Perhaps I’m overlooking something, I don’t write too much in python, so forgive me if this is a trivial question.

Asked By: Steve P.

||

Source

Answer 1

You can get the start and end index from the match object – re.MatchObject.start(group), re.MatchObject.end(group):

regex = compile ("(?:[a-zA-Z]{3})*?(ABC|DEF|GHI)") 

for m in re.finditer(regex, "STCABCFFC"):
    print m.start(1), m.end(1)
    print m.span(1)  # Prints 2-element tuple `(start, end)`

Answered By: Rohit Jain

Answer 2

You were on the right track. start is a method for the MatchObject. Here’s the example they give in the docs:

>>> email = "tony@tiremove_thisger.net"
>>> m = re.search("remove_this", email)
>>> email[:m.start()] + email[m.end():]
'[email protected]'

Basically, instead of match(regex, text).group(1).start() you should do match(regex, text).start(1).

Answered By: DaoWen

Answer 3

It would be error-prone to use match index like start(1), and using named group would be more intuitive(code adapted from Rohit Jain’s answer):

regex = compile ("(?:[a-zA-Z]{3})*?(?P<my_group>ABC|DEF|GHI)") 

for m in re.finditer(regex, "STCABCFFC"):
    print(m.start('my_group'), m.end('my_group'))
    print(m.span('my_group'))  # Prints 2-element tuple `(start, end)`

# outputs: 
# 3 6
# (3, 6)

Answered By: Lerner Zhang

Get start location of capturing group within regex pattern

Question:

Answers: