matching any character including newlines in a Python regex subexpression, not globally
Question:
I want to use re.MULTILINE
but NOT re.DOTALL
, so that I can have a regex that includes both an “any character” wildcard and the normal .
wildcard that doesn’t match newlines.
Is there a way to do this? What should I use to match any character in those instances that I want to include newlines?
Answers:
To match a newline, or "any symbol" without re.S
/re.DOTALL
, you may use any of the following:
-
(?s).
– the inline modifier group with s
flag on sets a scope where all .
patterns match any char including line break chars
-
Any of the following work-arounds:
[sS]
[wW]
[dD]
The main idea is that the opposite shorthand classes inside a character class match any symbol there is in the input string.
Comparing it to (.|s)
and other variations with alternation, the character class solution is much more efficient as it involves much less backtracking (when used with a *
or +
quantifier). Compare the small example: it takes (?:.|n)+
45 steps to complete, and it takes [sS]+
just 2 steps.
See a Python demo where I am matching a line starting with 123
and up to the first occurrence of 3
at the start of a line and including the rest of that line:
import re
text = """abc
123
def
356
more text..."""
print( re.findall(r"^123(?s:.*?)^3.*", text, re.M) )
# => ['123ndefn356']
print( re.findall(r"^123[wW]*?^3.*", text, re.M) )
# => ['123ndefn356']
Match any character (including new line):
Regular Expression: (Note the use of space ‘ ‘ is also there)
[Sntv ]
Example:
import re
text = 'abc def ###A quick brown fox.nIt jumps over the lazy dog### ghi jkl'
# We want to extract "A quick brown fox.nIt jumps over the lazy dog"
matches = re.findall('###[Sn ]+###', text)
print(matches[0])
The ‘matches[0]’ will contain:
‘A quick brown fox.nIt jumps over the lazy dog’
Description of ‘S’ Python docs:
S
Matches any character which is not a whitespace character.
( See: https://docs.python.org/3/library/re.html#regular-expression-syntax )
I want to use re.MULTILINE
but NOT re.DOTALL
, so that I can have a regex that includes both an “any character” wildcard and the normal .
wildcard that doesn’t match newlines.
Is there a way to do this? What should I use to match any character in those instances that I want to include newlines?
To match a newline, or "any symbol" without re.S
/re.DOTALL
, you may use any of the following:
-
(?s).
– the inline modifier group withs
flag on sets a scope where all.
patterns match any char including line break chars -
Any of the following work-arounds:
[sS]
[wW]
[dD]
The main idea is that the opposite shorthand classes inside a character class match any symbol there is in the input string.
Comparing it to (.|s)
and other variations with alternation, the character class solution is much more efficient as it involves much less backtracking (when used with a *
or +
quantifier). Compare the small example: it takes (?:.|n)+
45 steps to complete, and it takes [sS]+
just 2 steps.
See a Python demo where I am matching a line starting with 123
and up to the first occurrence of 3
at the start of a line and including the rest of that line:
import re
text = """abc
123
def
356
more text..."""
print( re.findall(r"^123(?s:.*?)^3.*", text, re.M) )
# => ['123ndefn356']
print( re.findall(r"^123[wW]*?^3.*", text, re.M) )
# => ['123ndefn356']
Match any character (including new line):
Regular Expression: (Note the use of space ‘ ‘ is also there)
[Sntv ]
Example:
import re
text = 'abc def ###A quick brown fox.nIt jumps over the lazy dog### ghi jkl'
# We want to extract "A quick brown fox.nIt jumps over the lazy dog"
matches = re.findall('###[Sn ]+###', text)
print(matches[0])
The ‘matches[0]’ will contain:
‘A quick brown fox.nIt jumps over the lazy dog’
Description of ‘S’ Python docs:
S
Matches any character which is not a whitespace character.
( See: https://docs.python.org/3/library/re.html#regular-expression-syntax )