Repeat entire group 0 or more times (one or more words separated by +'s)

Question:

I am trying to match words separated with the + character as input from a user in python and check if each of the words is in a predetermined list. I am having trouble creating a regular expression to match these words (words are comprised of more than one A-z characters). For example, an input string foo should match as well as foo+bar and foo+bar+baz with each of the words (not +‘s) being captured.

So far, I have tried a few regular expressions but the closest I have got is this:

/^([A-z+]+)+([A-z+]+)$/

However, this only matches the case in which there are two words separated with a +, I need there to be one or more words. My method above would have worked if I could somehow repeat the second group (+([A-z+]+)) zero or more times. So hence my question is: How can I repeat a capturing group zero or more times?

If there is a better way to do what I am doing, please let me know.

Asked By: Nasser Kessas

||

Answers:

You could write the pattern as:

(?i)[A-Z]+(?:+[A-Z]+)*$

Explanation

  • (?i) Inline modifier for case insensitive
  • [A-Z]+ Match 1+ chars A-Z
  • (?:+[A-Z]+)* Optionally repeat matching + and again 1+ chars A-Z
  • $ End of string

See a regex101 demo for the matches:

For example

import re

predeterminedList = ["foo", "bar"]
strings = ["foo", "foo+bar", "foo+bar+baz", "test+abc"]
pattern = r"(?i)[A-Z]+(?:+[A-Z]+)*$"

for s in strings:
    m = re.match(pattern, s)
    if m:
        words = m.group().split("+")
        intersect = bool(set(words) & set(predeterminedList))
        fmt = ','.join(predeterminedList)
        if intersect:
            print(f"'{s}' contains at least one of '{fmt}'")
        else:
            print(f"'{s}' contains none of '{fmt}'")

Another option could be created a dynamic pattern listing the alternatives:

(?i)^(?:[A-Z]++)*(?:foo|bar)(?:+[A-Z]+)*$

Example

import re

predeterminedList = ["foo", "bar"]
strings = ["foo", "foo+bar", "foo+bar+baz", "test+abc"]
pattern = rf"(?i)^(?:[A-Z]++)*(?:{'|'.join(predeterminedList)})(?:+[A-Z]+)*$"

for s in strings:
    m = re.match(pattern, s)
    fmt = ','.join(predeterminedList)
    if m:
        print(f"'{s}' contains at least one of '{fmt}'")
    else:
        print(f"'{s}' contains none of '{fmt}'")

Both will output:

'foo' contains at least one of 'foo,bar'
'foo+bar' contains at least one of 'foo,bar'
'foo+bar+baz' contains at least one of 'foo,bar'
'test+abc' contains none of 'foo,bar'
Answered By: The fourth bird

I would recommend slightly different approach using lookarounds:

Pattern: (?<=^|+)(?=foo|baz)[^+]+

Pattern explanation:

(?<=^|+) – positive lookbehind – assert that preceeding text is neither ^ (beginning of string) or + (our ‘word delimiter’).

(?=foo|baz) – positive lookahead – assert that following text match one of words (from predefined list)

[^+]+ – match one or more characters other from +

Regex demo

Answered By: MichaƂ Turczyn
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.