How to match a group of words that appear consecutively in any order?

Question:

Say I have a group (or set) of words: {foo, bar, baz}, and I want to match and extract the group. The words can be in any order, but they need to be next to each other. For example,

hello foo bar baz wow yeah => foo bar baz
hello bar foo baz wow yeah => bar foo baz
wow yeah hello baz bar foo hello => baz bar foo
baz yeah bar foo hello => no match

What’d be a good regex, preferably Python, to accomplish this?

Asked By: yukw777

||

Answers:

If each word can only appear once (in the entire string), you may use:

(?:b(foo|bar|baz)(?!.*b1b) ){3}

Demo.

If words might repeat, I don’t think you can get any shorter than something like this:*

b(foo|bar|baz) (?!1)(foo|bar|baz) (?!1|2)(foo|bar|baz)b

Demo.

Details:

  • b – Word boundary.
  • (foo|bar|baz) – Match any of the specified words and capture it in group 1.
  • (?!1) – A space character not immediately followed by the word captured in group 1.
  • (foo|bar|baz) – Match any of the specified words and capture it in group 2.
  • (?!1|2) – A space char not immediately followed by any of the words previously captured.
  • (foo|bar|baz) – Match any of the specified words and capture it in group 3.
  • b – Word boundary.

Note: The third occurrence of foo|bar|baz can be used without a capturing group (i.e., in a non-capturing group) but I left it there for consistency.

Python example:

import re

regex = r"b(foo|bar|baz) (?!1)(foo|bar|baz) (?!1|2)(foo|bar|baz)b"
test_str = """hello foo bar baz wow yeah
hello bar foo baz wow yeah
wow yeah hello baz bar foo hello
baz yeah bar foo hello"""

matches = re.finditer(regex, test_str, re.MULTILINE)
for match in matches:
    print (match.group())

Output:

foo bar baz
bar foo baz
baz bar foo

Try it online.


* We can actually use a slightly shorter pattern for this specific case: b(foo|ba[rz]) (?!1)(foo|ba[rz]) (?!1|2)(foo|ba[rz])b but that wouldn’t work for any 3 words.

You can use positive lookaheads to capture what’s after 3 words and match each of the 3 desired words with a lookahead assertion that what was captured will follow:

(?=(w+ w+ w+)(.*))(?=.*bfoob.*2)(?=.*bbarb.*2)(?=.*bbazb.*2)

Each match can then be found in group #1.

Demo: https://regex101.com/r/0tHKN5/2

EDIT: Performance improved from 5490 to 1377 steps according to regex101 with a word boundary assertion at the start and at most 2 words around each keyword instead of trying until the end with .*:

(?=(bw+ w+ w+)(.*))(?=(?:w+ ){,2}bfoob(?: w+){,2}2)(?=(?:w+ ){,2}bbarb(?: w+){,2}2)(?=(?:w+ ){,2}bbazb(?: w+){,2}2)

Demo: https://regex101.com/r/0tHKN5/3

Answered By: blhsing
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.