How does python and the regex module handle backslashes?

Question:

My current understanding of the python 3.4 regex library from the language reference does not seem to match up with my experiment results of the module.


My current understanding

The regular expression engine can be thought of as a separate entity with its own programming language that it understands (regex). It just happens to live inside python, among a variety of other languages. As such, python must pass (regex) pattern/code to this independent interpreter, if you will.

For clarity reasons, the following text will use the notion of logical length – which is supposed to represent how long the given string logically is. For example, the special character carriage return r will have len=1 since it is a single character. However, the 2 distinct characters (backslash followed by an r) r will have len=2.

Step 1) Lets say we want to match a carriage return r len=1 in some text.

Step 2) We need to feed the pattern r len=2 (2 distinct characters) to the regular expression engine.

Step 3) The regular expression engine recieves r len=2 and interprets the pattern as: match special character carriage return r len=1.

Step 4) It goes ahead and does the magic.

The problem is that the backslash character itself is used by the python interpreter as something special – a character meant to escape other stuff (like quotes).

So when we are coding in python and need to express the idea that we need to send the pattern r len=2 to the internal regular expression interpreter, we must type pattern = '\r' or alternatively pattern = r'r' to express r len=2.


And everything is well… until

I try a couple of experiments involving re.escape

enter image description here

enter image description here

enter image description here


Summary of questions

Point 1) Please confirm/modify my current understanding of the regex engine.

Point 2) Why are these supposed non-textbook definition patterns matching.

Point 3) What on earth is going on with \r from re.escape, and the whole "we have the same string lengths, but we compared unequal, but we ALSO all worked the same in matching a carriage return in the previous re.search test".

Asked By: AlanSTACK

||

Answers:

You need to understand that each time you write a pattern, it is first interpreted as a string before to be read and interpreted a second time by the regex engine.
Lets describe what happens:

>>> s='r'

s contains the character CR.

>>> re.match('r', s)
<_sre.SRE_Match object; span=(0, 1), match='r'>

Here the string 'r' is a string that contains CR, so a literal CR is given to the regex engine.

>>> re.match('\r', s)
<_sre.SRE_Match object; span=(0, 1), match='r'>

The string is now a literal backslash and a literal r, the regex engine receives these two characters and since r is a regex escape sequence that means a CR character too, you obtain a match too.

>>> re.match('\r', s)
<_sre.SRE_Match object; span=(0, 1), match='r'>

The string contains a literal backslash and a literal CR, the regex engine receives and CR, but since CR isn’t a known regex escape sequence, the backslash is ignored and you obtain a match.

Note that for the regex engine, a literal backslash is the escape sequence \ (so in a pattern string r'\' or '\\')

Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.