See which component in regex alternation was captured

Question:

In regex alternation, is there a way to retrieve which alternation was matched? I just need the first alternation match, not all the alternations that match.

For example, I have a regex like this

pattern = r'(abc.*def|mno.*pqr|mno.*pqrt|.....)'
string = 'mnoxxxpqrt'

I want the output to be 'mno.*pqr'

How should I write the regex statement? Python language is preferred.

Asked By: Bao Le

||

Answers:

Well you could iterate the terms in the regex alternation:

string = 'abcxxxdef'
pattern = r'(abc.*def|mno.*pqr)'
terms = pattern[1:-1].split("|")
for term in terms:
    if re.search(term, string):
        print("MATCH => " + term)

This prints:

MATCH => abc.*def
Answered By: Tim Biegeleisen

You can use capture groups:

import re
string = 'abcxxxdef'
patterns = ['abc.*def', 'mno.*pqr']
match = re.match(r'((abc.*def)|(mno.*pqr))',string)
groups = match.groups()
alternations = []
for i in range(1, len(groups)):
    if (groups[i] != None):
        pattern = patterns[i-1]
        break
print(pattern)

Result: mno.*pqr

Expressions inside round brackets are capture groups, they correspond to the 1st to last index of the response. The 0th index is the whole match.

Then you would need to find the index which matched. Except your patterns would need to be fined before hand.

Answered By: BrendanOtherwhyz

To do this efficiently without any iterations, you can put your desired sub-patterns in a list and join them into one alternation pattern with each sub-pattern enclosed in a capture group (so the resulting pattern looks like (abc.*def)|(mno.*pqr) instead of (abc.*def|mno.*pqr)). You can then obtain the group number of the sub-pattern with the Match object’s lastindex attribute and in turn obtain the matching sub-pattern from the original list of sub-patterns:

import re

patterns = [r'abc.*def', r'mno.*pqr', r'mno.*pqrt']
pattern = '|'.join(map('({})'.format, patterns))
string = 'mno_foobar_pqrt'
print(pattern)
print(patterns[re.search(pattern, string).lastindex - 1])

This outputs:

(abc.*def)|(mno.*pqr)|(mno.*pqrt)
mno.*pqr

Demo: https://replit.com/@blhsing/JointBruisedMention

Answered By: blhsing

The right answer to the question How should I write the regex statement? should actually be:

There is no known way to write the regex statement using the provided regex pattern which will allow to extract from the regex search result the information which of the alternatives have triggered the match.

And as there is no way to do it using the given pattern it is necessary to change the regex pattern which then makes it possible to extract from the match the requested information.

A possible way around this regex engine limitation is proposed below, but it requires an additional regex pattern search and has the disadvantage that there is a chance that it fails for some special search pattern alternatives.

The below provided code allows usage of simpler regex patterns without defining groups and works the "other way around" by checking which of the alternate patterns triggers a match in the found match for the entire regex:

import re
pattern = r'abc.*def|mno.*pqr|mno.*pqrt'
text    = 'mnoxxxpqrt'
match   = re.match(pattern,text)[0]
print(next(p for p in pattern.split('|') if re.match(p, match)))

It might fail in case when in the text found match string fails to be also a match for the single regex pattern what can happen for example if a non-word boundary B requirement is used in the search pattern ( as mentioned in the comments by Kelly Bundy ).


A not failing alternative solution is to perform the regex search using a modified regex pattern. Below an approach using a dictionary for defining the alternatives and a function returning the matched group:

import re

dct_alts = {1:r'(abc.*def)',2:r'(mno.*pqr)',3:r'(mno.*pqrt)'}
#           ^-- the dictionary index  is the index of the matching group in the found match. 

text     = 'mnoxxxpqrt'
    
def get_matched_group(dct_alts, text):
    pattern  = '|'.join(dct_alts.values())
    re_match = re.match(pattern, text) 
    return(dct_alts[re_match.lastindex])

print(get_matched_group(dct_alts, text))

prints

(mno.*pqr)

For the sake of completeness a function returning a list of all of the alternatives which give a match (not only the first one which matches):

import re
lst_alts = [r'abc.*def', r'mno.*pqr', r'mno.*pqrt']
text     = 'mnoxxxpqrt'
def get_all_matched_groups(lst_alts, text):
    matches = []
    for pattern in lst_alts: 
        re_match = re.match(pattern, text) 
        if re_match:
            matches.append(pattern)
    return matches
print(get_all_matched_groups(lst_alts, text))

prints

['mno.*pqr', 'mno.*pqrt']
Answered By: Claudio
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.