Python regex extract groups between different potential expressions
Question:
I’d like to extract groups that are between characters or sets of characters. The issue is that some characters in groups can also be in the sets of characters used to extract the groups. Here is what I mean:
My sentence is something like:
text = '[[aaa / bbb (T1=T2)] / [bbb (T1=T2) / bbb (T1>T2)]]'
At the end I’d like to get a list like:
['aaa', 'bbb (T1=T2)', 'bbb (T1=T2)', 'bbb (T1>T2)']
so I should ‘cut’ at [*
, ]*
or s/s
.
I’ve tried in Python re.findall(r'[[*|]*|s/s](.*?)[[*|]*|s/s]', text)
but the output is ['', '', 'bbb', '', '', 'bbb', '', 'bbb', '']
. I have tried many things actually, and of course I searched a lot on the internet before posting. Then, on https://regexr.com/ I realized that the pattern s/s
was correctly detected, but as soon as I added the range characters [
and ]
to do [[*|]*|s/s]
, all single spaces where also detected, because the range sees s
and says "ok let’s split at every space". That makes sense, but that’s not how I need to split my sentences. I’ve tried adding brackets or parentheses around s/s
but that doesn’t work, in https://regexr.com/ or in Python.
Do you have an idea of how to include a set of characters of an expression in the possible patterns to extract groups?
Thanks a lot!
Answers:
import re
text = '[[aaa / bbb (T1=T2)] / [bbb (T1=T2) / bbb (T1>T2)]]'
result = re.findall(r'bw+b(?:s*([^()]*))?', text)
print(result)
# Output: ['aaa', 'bbb (T1=T2)', 'bbb (T1=T2)', 'bbb (T1>T2)']
This regex pattern matches:
bw+b
– a word boundary followed by one or more word characters (matching the first word in each bracket)
(?:s*([^()]*))?
– an optional non-capturing group that matches zero or more whitespace characters, followed by a parentheses group containing zero or more non-parentheses characters (matching any optional text in parentheses after the word)
I hope this helps!
If the single square brackets should be part of the match, you could capture in a group what is in between and then use split.
-
The split pattern s+/s+
matches a forward slash between 1 or more whitespace characters.
-
The match pattern [([^][]*)]
matches [
till the first occurrence of ]
using a negated character class and captures in group 1 what is in between.
For example
import re
text = '[[aaa / bbb (T1=T2)] / [bbb (T1=T2) / bbb (T1>T2)]]'
pattern = r"[([^][]*)]"
res = []
for s in re.findall(pattern, text):
res += re.split(r"s+/s+", s)
print(res)
Output
['aaa', 'bbb (T1=T2)', 'bbb (T1=T2)', 'bbb (T1>T2)']
See the group 1 match in this regex demo and a Python demo.
I’d like to extract groups that are between characters or sets of characters. The issue is that some characters in groups can also be in the sets of characters used to extract the groups. Here is what I mean:
My sentence is something like:
text = '[[aaa / bbb (T1=T2)] / [bbb (T1=T2) / bbb (T1>T2)]]'
At the end I’d like to get a list like:
['aaa', 'bbb (T1=T2)', 'bbb (T1=T2)', 'bbb (T1>T2)']
so I should ‘cut’ at [*
, ]*
or s/s
.
I’ve tried in Python re.findall(r'[[*|]*|s/s](.*?)[[*|]*|s/s]', text)
but the output is ['', '', 'bbb', '', '', 'bbb', '', 'bbb', '']
. I have tried many things actually, and of course I searched a lot on the internet before posting. Then, on https://regexr.com/ I realized that the pattern s/s
was correctly detected, but as soon as I added the range characters [
and ]
to do [[*|]*|s/s]
, all single spaces where also detected, because the range sees s
and says "ok let’s split at every space". That makes sense, but that’s not how I need to split my sentences. I’ve tried adding brackets or parentheses around s/s
but that doesn’t work, in https://regexr.com/ or in Python.
Do you have an idea of how to include a set of characters of an expression in the possible patterns to extract groups?
Thanks a lot!
import re
text = '[[aaa / bbb (T1=T2)] / [bbb (T1=T2) / bbb (T1>T2)]]'
result = re.findall(r'bw+b(?:s*([^()]*))?', text)
print(result)
# Output: ['aaa', 'bbb (T1=T2)', 'bbb (T1=T2)', 'bbb (T1>T2)']
This regex pattern matches:
bw+b
– a word boundary followed by one or more word characters (matching the first word in each bracket)
(?:s*([^()]*))?
– an optional non-capturing group that matches zero or more whitespace characters, followed by a parentheses group containing zero or more non-parentheses characters (matching any optional text in parentheses after the word)
I hope this helps!
If the single square brackets should be part of the match, you could capture in a group what is in between and then use split.
-
The split pattern
s+/s+
matches a forward slash between 1 or more whitespace characters. -
The match pattern
[([^][]*)]
matches[
till the first occurrence of]
using a negated character class and captures in group 1 what is in between.
For example
import re
text = '[[aaa / bbb (T1=T2)] / [bbb (T1=T2) / bbb (T1>T2)]]'
pattern = r"[([^][]*)]"
res = []
for s in re.findall(pattern, text):
res += re.split(r"s+/s+", s)
print(res)
Output
['aaa', 'bbb (T1=T2)', 'bbb (T1=T2)', 'bbb (T1>T2)']
See the group 1 match in this regex demo and a Python demo.