Python regex matching whatever matched in previous group (1 out of many)

Question:

I have the regular expression (?:AA|BB)(.*)(?:AA|BB) which captures everything between the delimiters AA and BB.

The problem I encounter is that this will also match AA...BB. This is something that I don’t want. How can I make it so that the regular expression only matches AA...AA and BB...BB?

Asked By: AlanSTACK

||

Answers:

If the strings you need to match start and end with the same leading and trailing delimiters, you just need to capture the leading delimiter and use a backreference inside the pattern itself:

(AA|BB)(.*)1
^     ^    ^^

See the regex demo

In Python, you will have to use re.finditer if you want to get only the group you need, not re.findall that will return a tuple list (and will thus contain AA or BB). To match the substrings from AA till the first next AA, use a lazy quantifier *?: (AA|BB)(.*?)1

A short Python demo:

import re
p = re.compile(r'(AA|BB)(.*)1')
test_str = "AA text AA"
print([x.group(2).strip() for x in p.finditer(test_str)])
# => ['text']

If you need to match strings with mismatching leading and trailing delimiters, you will have to use alternation:

AA(.*)AA|BB(.*)BB

Or – a lazy quantifier version to match the closest trailing AAs and BBs:

AA(.*?)ZZ|BB(.*?)YY

Note that this will output empty elements in the results since only one group will be matched. In most Python builds, this pattern should be used with caution if you plan to use it in re.sub (until Python 3.5, the non-matched group is not initialized with an empty string (=None) and might throw an exception.

Here is an extraction sample code with re.finditer:

import re
p = re.compile(r'(AA)(.*?)(ZZ)|(BB)(.*?)(YY)')
test_str = "AA Text 1 here ZZ and BB Text2 there YY"
print("Contents:") 
print([x.group(2).strip() for x in p.finditer(test_str) if x.group(2)])
print([x.group(5).strip() for x in p.finditer(test_str) if x.group(5)])
print("Delimiters:")
print([(x.group(1), x.group(3)) for x in p.finditer(test_str) if x.group(1) and x.group(3)])
print([(x.group(4), x.group(6)) for x in p.finditer(test_str) if x.group(4) and x.group(6)])

Results:

Contents:
['Text 1 here']
['Text2 there']
Delimiters:
[('AA', 'ZZ')]
[('BB', 'YY')]

In real life, with very long and complex texts, these regexps can be unrolled to make matching linear and efficient, but this is a different story.

And last but not least, if you need to match the shortest substring from one delimiter to another that does not contain these delimiters inside, use a tempered greedy token:

AA((?:(?!AA|ZZ).)*)ZZ|BB((?:(?!BB|YY).)*)YY
   ^^^^^^^^^^^^^^^       ^^^^^^^^^^^^^^^ 

See the regex demo to see the difference from AA(.*?)ZZ|BB(.*?)YY.

Answered By: Wiktor Stribiżew

try this

AA(.*)AA|BB(.*)BB

look on this example

Answered By: sunny

The question is confusing. From what I understood, you want it to match either AA..AA or BB..BB, but not AA..BB which it is currently matching. I’m awful with regex, but I think this should work:
Edit: Sorry, SE formatting messed it up.

(?:(AA(.*)AA)|(BB(.*)BB))

>>> data = ['AAsometextAA', 'BBothertextBB', 'NotMatched', 'AAalsonotmatchedBB']
>>> matches = filter(lambda x: x is not None, [re.match("(?:(AA(.*)AA)|(BB(.*)BB))", datum) for datum in data])
>>> matches
[<_sre.SRE_Match object at 0x007DC078>, <_sre.SRE_Match object at 0x007DC288>]
>>> for match in matches:
...     print(match.group(0))
...
AAsometextAA
BBothertextBB
>>>
Answered By: Goodies

This should work for you.

(AA(.*)AA)|(BB(.*)BB)
Answered By: Rahul.M
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.