Python regex: Explain why expression not matching
Question:
I am using regex
library to find words that are in between specific other words, for example, I want to match "world" if and only if a greeting precedes it and punctuation follows. To avoid matching word prefixes and suffixes, I added the additional condition [^a-zA-Z]
. However, once I add these, regex
cannot match the word anymore:
>>> import regex
>>> pat = regex.compile("(?<=[^a-zA-Z](hello|hi)s+)world(?=s*[!?.][^a-zA-Z])")
>>> list(pat.finditer("hello world!"))
[]
>>> pat = regex.compile("(?<=b(hello|hi)s+)world(?=s*[!?.]b)")
>>> list(pat.finditer("hello world!"))
[]
>>> pat = regex.compile("(?<=(hello|hi)s+)world(?=s*[!?.])")
>>> list(pat.finditer("hello world!"))
[<regex.Match object; span=(6, 11), match='world'>]
How can this be explained? How to make sure to match whole words in the look ahead and behind sections?
Answers:
As correctly mentioned by @Michael, the width was the problem. The following does the trick:
>>> import regex
>>> pat = regex.compile("(?<=([^a-zA-Z]|^)(hello|hi)s+)world(?=s*[!?.]($|[^a-zA-Z]))")
>>> list(pat.finditer("hello world!"))
[<regex.Match object; span=(6, 11), match='world'>]
>>> list(pat.finditer("hello world!x"))
[]
>>> list(pat.finditer("xhello world!"))
[]
The reason is that when using (?<=
and (?=
there has to be present on the left and right what you specify.
Note that there is no word boundary after [!?.]b
when there is not a word character following any of the punctuation chars.
You could write the pattern as:
(?<=b(?:hello|hi)s+)world(?=s*[!?.](?!S))
Explanation
(?<=
Positive lookbehind, assert that to the left is
b(?:hello|hi)s+
Match either the word hello
or hi
and 1+ whitespace chars
)
Close lookbhehind
world
Match literally
(?=
Positive lookahead, assert that to the right is
s*[!?.]
Match optional whitespace chars and one of !
?
.
(?!S)
Assert a whitespace boundary to the right
)
Close the lookahead
Or asserting a whitespace boundary to the left instead of the word boundary:
(?<=(?<!S)(?:hello|hi)s+)world(?=s*[!?.](?!S))
I am using regex
library to find words that are in between specific other words, for example, I want to match "world" if and only if a greeting precedes it and punctuation follows. To avoid matching word prefixes and suffixes, I added the additional condition [^a-zA-Z]
. However, once I add these, regex
cannot match the word anymore:
>>> import regex
>>> pat = regex.compile("(?<=[^a-zA-Z](hello|hi)s+)world(?=s*[!?.][^a-zA-Z])")
>>> list(pat.finditer("hello world!"))
[]
>>> pat = regex.compile("(?<=b(hello|hi)s+)world(?=s*[!?.]b)")
>>> list(pat.finditer("hello world!"))
[]
>>> pat = regex.compile("(?<=(hello|hi)s+)world(?=s*[!?.])")
>>> list(pat.finditer("hello world!"))
[<regex.Match object; span=(6, 11), match='world'>]
How can this be explained? How to make sure to match whole words in the look ahead and behind sections?
As correctly mentioned by @Michael, the width was the problem. The following does the trick:
>>> import regex
>>> pat = regex.compile("(?<=([^a-zA-Z]|^)(hello|hi)s+)world(?=s*[!?.]($|[^a-zA-Z]))")
>>> list(pat.finditer("hello world!"))
[<regex.Match object; span=(6, 11), match='world'>]
>>> list(pat.finditer("hello world!x"))
[]
>>> list(pat.finditer("xhello world!"))
[]
The reason is that when using (?<=
and (?=
there has to be present on the left and right what you specify.
Note that there is no word boundary after [!?.]b
when there is not a word character following any of the punctuation chars.
You could write the pattern as:
(?<=b(?:hello|hi)s+)world(?=s*[!?.](?!S))
Explanation
(?<=
Positive lookbehind, assert that to the left isb(?:hello|hi)s+
Match either the wordhello
orhi
and 1+ whitespace chars
)
Close lookbhehindworld
Match literally(?=
Positive lookahead, assert that to the right iss*[!?.]
Match optional whitespace chars and one of!
?
.
(?!S)
Assert a whitespace boundary to the right
)
Close the lookahead
Or asserting a whitespace boundary to the left instead of the word boundary:
(?<=(?<!S)(?:hello|hi)s+)world(?=s*[!?.](?!S))