Regular expression to capture n words after pattern that do not contain that pattern

Question:

I’m trying to write a regular expression that captures n words after a pattern, which was answered in this question, except I want the search to keep going for another n words if it encounters that pattern again. For example, if my main search pattern is ‘x’, and I want to capture a word that contains ‘x’ and n=3 words after it that don’t contain ‘x’, the following string should result in three matches:

Lorem ipsum dolxor sit amet, consectetur adipiscing elit. Morxbi fringilla, dui axt tincidunt consectetur, libero arcu cursus arcxu, ut commodo lexctus magna vitxae venenatis neque.

Matches (‘x’s in bold for ease of viewing)

  1. dolxor sit amet, consectetur
  2. Morxbi fringilla, dui axt tincidunt consectetur, libero
  3. arcxu, ut commodo lexctus magna vitxae venenatis neque.

Matching n=3 words after is straightforward: [^ ]*x[^ ]*(?: [^ ]*){0,3}

How to keep going if another ‘x’ is encountered, I’m not sure. I’ve tried this — [^ ]*x[^ ]*(?: (?![^ ]*x[^ ]*)[^ ]*){0,3} — but it terminates the search instead of continuing on check the next n words, which, given the example above, gives six results instead of the expected three:

  1. dolxor sit amet, consectetur
  2. Morxbi fringilla, dui
  3. axt tincidunt consectetur, libero
  4. arcxu, ut commodo
  5. lexctus magna
  6. vitxae venenatis neque.

P.S. I’m working with python.


EDIT: For context, I’m trying to get sufficient information about the surroundings of each appearance of the given pattern. (For simplicity’s sake, I’m only including the words after the pattern, but it’s easy to generalize to words in the back.) And the problem with the first regex is that it might result in a word that lacks information on its surroundings if it gets picked up as part of the surroundings of another match. For example, the first match given the text above would be ‘Morxbi fringilla, dui axt’, which gives us information about what comes after ‘Morxbi’ but not ‘axt’. The second regex doesn’t help because now matches with another match in its surroundings will lose that information, e.g., we won’t know the third word that comes after ‘Morxbi’.

Asked By: Zachary

||

Answers:

It’s not clear that regex is the most natural way to solve your use case.
Consider this hybrid approach.

import re

pattern = re.compile(r"x")  # or whatever

def get_at_least_n(text: str, n=3) -> Optional[range]:
    words = text.split()
    matches = list(map(pattern.search, words))
    if not any(matches):
        return None
    last = sorted(_get_ranges(matches, n))[-1]
    _, i, j = last
    assert j >= i + n
    return range(i, j)

def _get_ranges(matches, n):
    for i in range(len(matches)):
        if matches[i]:
            j = i + n
            k = j + 1
            while k < len(matches) and k - j < n:
                if matches[k]:
                    j = k
        yield j - i, i, j

The regular expression engine loops over characters
and can handle CFGs, but is not Turing complete.
For one thing, evaluation is guaranteed to always Halt.
(Cheating a bit, wrapping it in a loop such as sed offers,
would enable Turing and even Conway’s Life: https://bitly.com/regexgol)

Here, the longest match you’re looking for could be
supplied in various orders, such as

A B x D E x G H x J K L M N O x Q R x T

or

A B x D E F G H I J K L M N O x Q R x T

so the correct answer for (1.) is "x D E x G H x J K"
and for (2.) is "x Q R x T".

Computing longest match using just the regex engine
is not straightforward,
and it would yield illegible code, unmaintainable.

Given that you have cPython’s Turing machine available
to you, hey, may as well use it, right?

time complexity

Let m denote the number of matches, the number of
input words.

It is clear the inner loop executes at most
m × n times.

So depending on whether you view n as constant,
we have either O(m × n) or O(m).

In contrast, "short" regular expressions with multiple
levels of backtracking
can easily
take cubic time or worse.

Answered By: J_H

Turns out the solution was a lot closer than I thought! Since [^ ]*x[^ ]*(?: (?![^ ]*x[^ ]*)[^ ]*){0,3} already captures the appropriate strings and the only problem is that it cuts them off instead of joining them (e.g., ‘Morxbi fringilla, dui’ and ‘axt tincidunt consectetur, libero’ instead of ‘Morxbi fringilla, dui axt tincidunt consectetur, libero’), then the solution would be simply to add a + to the overall expression (and accounting for the spaces between them) to join them: (?:[^ ]*x[^ ]*(?: (?![^ ]*x[^ ]*)[^ ]*){0,3} ?)+(?<! ) (example)

This solution can also be extended to looking for n words before and m words after the pattern: (?:(?:[^ ]* ){0,n}[^ ]*x[^ ]*(?: (?![^ ]*x[^ ]*)[^ ]*){0,m})+ (example with n=2 and m=3).

Thanks to @bobblebubble for making this suggestion in the comments.

Answered By: Zachary
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.