Python regex extract groups between different potential expressions

Question:

I’d like to extract groups that are between characters or sets of characters. The issue is that some characters in groups can also be in the sets of characters used to extract the groups. Here is what I mean:
My sentence is something like:
text = '[[aaa / bbb (T1=T2)] / [bbb (T1=T2) / bbb (T1>T2)]]'
At the end I’d like to get a list like:
['aaa', 'bbb (T1=T2)', 'bbb (T1=T2)', 'bbb (T1>T2)']
so I should ‘cut’ at [*, ]* or s/s.
I’ve tried in Python re.findall(r'[[*|]*|s/s](.*?)[[*|]*|s/s]', text) but the output is ['', '', 'bbb', '', '', 'bbb', '', 'bbb', '']. I have tried many things actually, and of course I searched a lot on the internet before posting. Then, on https://regexr.com/ I realized that the pattern s/s was correctly detected, but as soon as I added the range characters [ and ] to do [[*|]*|s/s], all single spaces where also detected, because the range sees s and says "ok let’s split at every space". That makes sense, but that’s not how I need to split my sentences. I’ve tried adding brackets or parentheses around s/s but that doesn’t work, in https://regexr.com/ or in Python.
Do you have an idea of how to include a set of characters of an expression in the possible patterns to extract groups?
Thanks a lot!

Asked By: Alexis Cllmb

||

Answers:

import re

text = '[[aaa / bbb (T1=T2)] / [bbb (T1=T2) / bbb (T1>T2)]]'
result = re.findall(r'bw+b(?:s*([^()]*))?', text)

print(result)
# Output: ['aaa', 'bbb (T1=T2)', 'bbb (T1=T2)', 'bbb (T1>T2)']

This regex pattern matches:

bw+b – a word boundary followed by one or more word characters (matching the first word in each bracket)

(?:s*([^()]*))? – an optional non-capturing group that matches zero or more whitespace characters, followed by a parentheses group containing zero or more non-parentheses characters (matching any optional text in parentheses after the word)

I hope this helps!

Answered By: 1aryo1

If the single square brackets should be part of the match, you could capture in a group what is in between and then use split.

  • The split pattern s+/s+ matches a forward slash between 1 or more whitespace characters.

  • The match pattern [([^][]*)] matches [ till the first occurrence of ] using a negated character class and captures in group 1 what is in between.

For example

import re

text = '[[aaa / bbb (T1=T2)] / [bbb (T1=T2) / bbb (T1>T2)]]'
pattern = r"[([^][]*)]"
res = []

for s in re.findall(pattern, text):
    res += re.split(r"s+/s+", s)

print(res)

Output

['aaa', 'bbb (T1=T2)', 'bbb (T1=T2)', 'bbb (T1>T2)']

See the group 1 match in this regex demo and a Python demo.

Answered By: The fourth bird
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.