Python regex: Explain why expression not matching

Question:

I am using regex library to find words that are in between specific other words, for example, I want to match "world" if and only if a greeting precedes it and punctuation follows. To avoid matching word prefixes and suffixes, I added the additional condition [^a-zA-Z]. However, once I add these, regex cannot match the word anymore:

>>> import regex

>>> pat = regex.compile("(?<=[^a-zA-Z](hello|hi)s+)world(?=s*[!?.][^a-zA-Z])")

>>> list(pat.finditer("hello world!"))
[]

>>> pat = regex.compile("(?<=b(hello|hi)s+)world(?=s*[!?.]b)")

>>> list(pat.finditer("hello world!"))
[]

>>> pat = regex.compile("(?<=(hello|hi)s+)world(?=s*[!?.])")

>>> list(pat.finditer("hello world!"))
[<regex.Match object; span=(6, 11), match='world'>]

How can this be explained? How to make sure to match whole words in the look ahead and behind sections?

Asked By: Green绿色

||

Answers:

As correctly mentioned by @Michael, the width was the problem. The following does the trick:

>>> import regex

>>> pat = regex.compile("(?<=([^a-zA-Z]|^)(hello|hi)s+)world(?=s*[!?.]($|[^a-zA-Z]))")

>>> list(pat.finditer("hello world!"))
[<regex.Match object; span=(6, 11), match='world'>]

>>> list(pat.finditer("hello world!x"))
[]

>>> list(pat.finditer("xhello world!"))
[]
Answered By: Green绿色

The reason is that when using (?<= and (?= there has to be present on the left and right what you specify.

Note that there is no word boundary after [!?.]b when there is not a word character following any of the punctuation chars.

You could write the pattern as:

(?<=b(?:hello|hi)s+)world(?=s*[!?.](?!S))

Explanation

  • (?<= Positive lookbehind, assert that to the left is
    • b(?:hello|hi)s+ Match either the word hello or hi and 1+ whitespace chars
  • ) Close lookbhehind
  • world Match literally
  • (?= Positive lookahead, assert that to the right is
    • s*[!?.] Match optional whitespace chars and one of ! ? .
    • (?!S) Assert a whitespace boundary to the right
  • ) Close the lookahead

Or asserting a whitespace boundary to the left instead of the word boundary:

(?<=(?<!S)(?:hello|hi)s+)world(?=s*[!?.](?!S))

Regex demo

Answered By: The fourth bird
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.