How to search for a character string that is an escape sequence with re.search

Question:

I wrote the code to check if the escape sequence "n" is included in the string. However, it behaved unexpectedly, so I would like to know the reason. Why did I get the result of case2?

Case 1

The code below worked. Since r"n" (reg1) is a string consisting of two characters, '' and 'n', I think it is correct to search for and match the target string "n".

import re
reg1 = r"n"
print (re.search (reg1, "n"))
#output: <re.Match object; span = (0, 1), match ='n'>

Case 2

The code below expected the output to be None, but it didn’t. Since "n" (reg2), which is the line feed of the escape sequence, was used as the pattern, and "n" consisting of two characters, '' and 'n', was used as the target string, it was considered that they did not match. However, it actually matched.

import re
reg2 = "n"
print (re.search (reg2, "n"))
#output: <re.Match object; span = (0, 1), match ='n'>
Asked By: keima

||

Answers:

You are correct when it comes to the contents of the strings used for the regexes, but not the targets. The statement:

"n" consisting of two characters, '' and 'n', was used as the target string,

is incorrect. The interpretation of a string is not context-sensitive; r"n" is always 2 characters, and "n" is always 1. This is covered in the Python Regular Expression HOWTO:

r"n" is a two-character string containing '' and 'n', while "n" is a one-character string containing a newline.

This is more easily demonstrated with a non-control character, as a literal "n" would be written:

Did you catch that? Let’s use "þ" (thorn) instead.

Case 1:

re.search(r"u00FE", "u00FE")

r"u00FE" is a string with 6 characters, which compiles to the regex /u00FE/. This is interpreted as an escape sequence by the regex library itself that matches a thorn character.

"u00FE" is interpreted by python, producing the string "þ".

/u00FE/ matches "þ".

Case 2:

re.search("u00FE", "u00FE")

"u00FE" is a string with 1 character, "þ", which compiles to the regex /⁠þ⁠/.

/þ/ matches "þ".

Result: both regexes match. The only difference is that the regex contains an escape sequence in case 1 and a character literal in case 2.

What you seem to have in mind is a raw string for the target:

re.search(r"u00FE", r"u00FE")
re.search("u00FE", r"u00FE")

Neither of these matches, as neither of the targets contains a thorn character.

If you wanted to match an escape sequence, the escape character must be escaped within the regex:

re.search(r"\u00FE", r"u00FE")
re.search("\\u00FE", r"u00FE")

Either of those patterns will result in the regex /\u00FE/, which matches a string containing the given escape sequence.

Answered By: outis
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.