Repeat entire group 0 or more times (one or more words separated by +'s)
Question:
I am trying to match words separated with the +
character as input from a user in python and check if each of the words is in a predetermined list. I am having trouble creating a regular expression to match these words (words are comprised of more than one A-z
characters). For example, an input string foo
should match as well as foo+bar
and foo+bar+baz
with each of the words (not +
‘s) being captured.
So far, I have tried a few regular expressions but the closest I have got is this:
/^([A-z+]+)+([A-z+]+)$/
However, this only matches the case in which there are two words separated with a +
, I need there to be one or more words. My method above would have worked if I could somehow repeat the second group (+([A-z+]+)
) zero or more times. So hence my question is: How can I repeat a capturing group zero or more times?
If there is a better way to do what I am doing, please let me know.
Answers:
You could write the pattern as:
(?i)[A-Z]+(?:+[A-Z]+)*$
Explanation
(?i)
Inline modifier for case insensitive
[A-Z]+
Match 1+ chars A-Z
(?:+[A-Z]+)*
Optionally repeat matching +
and again 1+ chars A-Z
$
End of string
See a regex101 demo for the matches:
For example
import re
predeterminedList = ["foo", "bar"]
strings = ["foo", "foo+bar", "foo+bar+baz", "test+abc"]
pattern = r"(?i)[A-Z]+(?:+[A-Z]+)*$"
for s in strings:
m = re.match(pattern, s)
if m:
words = m.group().split("+")
intersect = bool(set(words) & set(predeterminedList))
fmt = ','.join(predeterminedList)
if intersect:
print(f"'{s}' contains at least one of '{fmt}'")
else:
print(f"'{s}' contains none of '{fmt}'")
Another option could be created a dynamic pattern listing the alternatives:
(?i)^(?:[A-Z]++)*(?:foo|bar)(?:+[A-Z]+)*$
Example
import re
predeterminedList = ["foo", "bar"]
strings = ["foo", "foo+bar", "foo+bar+baz", "test+abc"]
pattern = rf"(?i)^(?:[A-Z]++)*(?:{'|'.join(predeterminedList)})(?:+[A-Z]+)*$"
for s in strings:
m = re.match(pattern, s)
fmt = ','.join(predeterminedList)
if m:
print(f"'{s}' contains at least one of '{fmt}'")
else:
print(f"'{s}' contains none of '{fmt}'")
Both will output:
'foo' contains at least one of 'foo,bar'
'foo+bar' contains at least one of 'foo,bar'
'foo+bar+baz' contains at least one of 'foo,bar'
'test+abc' contains none of 'foo,bar'
I would recommend slightly different approach using lookarounds:
Pattern: (?<=^|+)(?=foo|baz)[^+]+
Pattern explanation:
(?<=^|+)
– positive lookbehind – assert that preceeding text is neither ^
(beginning of string) or +
(our ‘word delimiter’).
(?=foo|baz)
– positive lookahead – assert that following text match one of words (from predefined list)
[^+]+
– match one or more characters other from +
I am trying to match words separated with the +
character as input from a user in python and check if each of the words is in a predetermined list. I am having trouble creating a regular expression to match these words (words are comprised of more than one A-z
characters). For example, an input string foo
should match as well as foo+bar
and foo+bar+baz
with each of the words (not +
‘s) being captured.
So far, I have tried a few regular expressions but the closest I have got is this:
/^([A-z+]+)+([A-z+]+)$/
However, this only matches the case in which there are two words separated with a +
, I need there to be one or more words. My method above would have worked if I could somehow repeat the second group (+([A-z+]+)
) zero or more times. So hence my question is: How can I repeat a capturing group zero or more times?
If there is a better way to do what I am doing, please let me know.
You could write the pattern as:
(?i)[A-Z]+(?:+[A-Z]+)*$
Explanation
(?i)
Inline modifier for case insensitive[A-Z]+
Match 1+ chars A-Z(?:+[A-Z]+)*
Optionally repeat matching+
and again 1+ chars A-Z$
End of string
See a regex101 demo for the matches:
For example
import re
predeterminedList = ["foo", "bar"]
strings = ["foo", "foo+bar", "foo+bar+baz", "test+abc"]
pattern = r"(?i)[A-Z]+(?:+[A-Z]+)*$"
for s in strings:
m = re.match(pattern, s)
if m:
words = m.group().split("+")
intersect = bool(set(words) & set(predeterminedList))
fmt = ','.join(predeterminedList)
if intersect:
print(f"'{s}' contains at least one of '{fmt}'")
else:
print(f"'{s}' contains none of '{fmt}'")
Another option could be created a dynamic pattern listing the alternatives:
(?i)^(?:[A-Z]++)*(?:foo|bar)(?:+[A-Z]+)*$
Example
import re
predeterminedList = ["foo", "bar"]
strings = ["foo", "foo+bar", "foo+bar+baz", "test+abc"]
pattern = rf"(?i)^(?:[A-Z]++)*(?:{'|'.join(predeterminedList)})(?:+[A-Z]+)*$"
for s in strings:
m = re.match(pattern, s)
fmt = ','.join(predeterminedList)
if m:
print(f"'{s}' contains at least one of '{fmt}'")
else:
print(f"'{s}' contains none of '{fmt}'")
Both will output:
'foo' contains at least one of 'foo,bar'
'foo+bar' contains at least one of 'foo,bar'
'foo+bar+baz' contains at least one of 'foo,bar'
'test+abc' contains none of 'foo,bar'
I would recommend slightly different approach using lookarounds:
Pattern: (?<=^|+)(?=foo|baz)[^+]+
Pattern explanation:
(?<=^|+)
– positive lookbehind – assert that preceeding text is neither ^
(beginning of string) or +
(our ‘word delimiter’).
(?=foo|baz)
– positive lookahead – assert that following text match one of words (from predefined list)
[^+]+
– match one or more characters other from +