Escaping regex string
Question:
I want to use input from a user as a regex pattern for a search over some text. It works, but how I can handle cases where user puts characters that have meaning in regex?
For example, the user wants to search for Word (s)
: regex engine will take the (s)
as a group. I want it to treat it like a string "(s)"
. I can run replace
on user input and replace the (
with (
and the )
with )
but the problem is I will need to do replace for every possible regex symbol.
Do you know some better way ?
Answers:
Use the re.escape()
function for this:
escape(string)
Return string with all non-alphanumerics backslashed; this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.
A simplistic example, search any occurence of the provided string optionally followed by ‘s’, and return the match object.
def simplistic_plural(word, text):
word_or_plural = re.escape(word) + 's?'
return re.match(word_or_plural, text)
You can use re.escape()
:
re.escape(string)
Return string with all non-alphanumerics backslashed; this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.
>>> import re
>>> re.escape('^a.*$')
'\^a\.\*\$'
If you are using a Python version < 3.7, this will escape non-alphanumerics that are not part of regular expression syntax as well.
If you are using a Python version < 3.7 but >= 3.3, this will escape non-alphanumerics that are not part of regular expression syntax, except for specifically underscore (_
).
Unfortunately, re.escape()
is not suited for the replacement string:
>>> re.sub('a', re.escape('_'), 'aa')
'\_\_'
A solution is to put the replacement in a lambda:
>>> re.sub('a', lambda _: '_', 'aa')
'__'
because the return value of the lambda is treated by re.sub()
as a literal string.
Please give a try:
Q and E as anchors
Put an Or condition to match either a full word or regex.
Ref Link : How to match a whole word that includes special characters in regex
The answer of Owen can lead to inconsistencies. A lambda should just be an inline replacement for a function call, but it produces different results as shown below. When somebody would have to ‘upgrade’ the lambda to a function call, for instance to build in some extra complexity, this would suddenly break down:
import re
xml = """pre@mytag@123@/mytag@post"""
replacewith = '@mytag@456 \1@/mytag@'
regexp = re.compile(r'@mytag@(.*?)@/mytag@', re.S|re.M|re.I)
def rw(inp):
return inp
result = regexp.sub(lambda _: replacewith, xml)
print(result) # desired result
result = regexp.sub(rw(replacewith), xml)
print(result) # undesired result
I want to use input from a user as a regex pattern for a search over some text. It works, but how I can handle cases where user puts characters that have meaning in regex?
For example, the user wants to search for Word (s)
: regex engine will take the (s)
as a group. I want it to treat it like a string "(s)"
. I can run replace
on user input and replace the (
with (
and the )
with )
but the problem is I will need to do replace for every possible regex symbol.
Do you know some better way ?
Use the re.escape()
function for this:
escape(string)
Return string with all non-alphanumerics backslashed; this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.
A simplistic example, search any occurence of the provided string optionally followed by ‘s’, and return the match object.
def simplistic_plural(word, text):
word_or_plural = re.escape(word) + 's?'
return re.match(word_or_plural, text)
You can use re.escape()
:
re.escape(string)
Return string with all non-alphanumerics backslashed; this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.
>>> import re
>>> re.escape('^a.*$')
'\^a\.\*\$'
If you are using a Python version < 3.7, this will escape non-alphanumerics that are not part of regular expression syntax as well.
If you are using a Python version < 3.7 but >= 3.3, this will escape non-alphanumerics that are not part of regular expression syntax, except for specifically underscore (_
).
Unfortunately, re.escape()
is not suited for the replacement string:
>>> re.sub('a', re.escape('_'), 'aa')
'\_\_'
A solution is to put the replacement in a lambda:
>>> re.sub('a', lambda _: '_', 'aa')
'__'
because the return value of the lambda is treated by re.sub()
as a literal string.
Please give a try:
Q and E as anchors
Put an Or condition to match either a full word or regex.
Ref Link : How to match a whole word that includes special characters in regex
The answer of Owen can lead to inconsistencies. A lambda should just be an inline replacement for a function call, but it produces different results as shown below. When somebody would have to ‘upgrade’ the lambda to a function call, for instance to build in some extra complexity, this would suddenly break down:
import re
xml = """pre@mytag@123@/mytag@post"""
replacewith = '@mytag@456 \1@/mytag@'
regexp = re.compile(r'@mytag@(.*?)@/mytag@', re.S|re.M|re.I)
def rw(inp):
return inp
result = regexp.sub(lambda _: replacewith, xml)
print(result) # desired result
result = regexp.sub(rw(replacewith), xml)
print(result) # undesired result