Regex: Provide match for beginning of a sentence ignoring new lines
Question:
string= "This is a sentence. Micky Mouse"
name= re.compile(f".?Micky Mouse")
name_match = name.search(string)
print(name_match)
I want to ensure that a match is only provided if "Micky Mouse" is at the beginning of a new sentence, i.e., only if it follows on a dot "."
However, there should also be a match irrespective of any new lines or spacings between "Micky Mouse" and the end of the previous sentence. So the following expression should also provide a match print("This is a sentence. nMicky Mouse")
Answers:
The s
flag matches for all whitespace characters including n
.
Something like the following should do the trick
re.compile(".s?Mickey Mouse")
In order to be at the beginning of a sentence, and ignore any whitespace differences after it, prepend the match target with (?:^|.)s*
.
(?:)
-> it doesn’t create a group
^|.
-> either the beginning of the String ^
or |
a literal dot .
s*
-> any amount of whitespace, including newlines, spaces, tabs, etc.
import re
string= """This is a sentence. Micky Mouse.
Micky Mouse again. No Micky Mouse match here."""
pattern = re.compile(f"(?:^|.)s*Micky Mouse")
name_match = re.finditer(pattern, string)
print([match.group(0) for match in name_match])
output:
['. Micky Mouse', '. n Micky Mouse']
You can match optional whitespace chars after the dot:
.s*Micky Mouseb
The pattern matches:
.s*
Match a dot and optional whitespace chars (that can also match a newline)
Micky Mouseb
Match literally followed by a word boundary
string= "This is a sentence. Micky Mouse"
name= re.compile(f".?Micky Mouse")
name_match = name.search(string)
print(name_match)
I want to ensure that a match is only provided if "Micky Mouse" is at the beginning of a new sentence, i.e., only if it follows on a dot "."
However, there should also be a match irrespective of any new lines or spacings between "Micky Mouse" and the end of the previous sentence. So the following expression should also provide a match print("This is a sentence. nMicky Mouse")
The s
flag matches for all whitespace characters including n
.
Something like the following should do the trick
re.compile(".s?Mickey Mouse")
In order to be at the beginning of a sentence, and ignore any whitespace differences after it, prepend the match target with (?:^|.)s*
.
(?:)
-> it doesn’t create a group^|.
-> either the beginning of the String^
or|
a literal dot.
s*
-> any amount of whitespace, including newlines, spaces, tabs, etc.
import re
string= """This is a sentence. Micky Mouse.
Micky Mouse again. No Micky Mouse match here."""
pattern = re.compile(f"(?:^|.)s*Micky Mouse")
name_match = re.finditer(pattern, string)
print([match.group(0) for match in name_match])
output:
['. Micky Mouse', '. n Micky Mouse']
You can match optional whitespace chars after the dot:
.s*Micky Mouseb
The pattern matches:
.s*
Match a dot and optional whitespace chars (that can also match a newline)Micky Mouseb
Match literally followed by a word boundary