Inconsistency between $ and ^ in regex when using start/end arguments to re.search?
Question:
From what I’ve read, ^
should match the start of a string, and $
the end. However, with re.search()
, it looks like the behavior of ^
continues to work fine, while $
‘breaks’. Example:
>>> a = re.compile( "^a" )
>>> print a.search( "cat", 1, 3 )
None
This seems correct to me — 'a'
is not at the start of the string, even if it is at the start of the search.
>>> a = re.compile( "a$" )
>>> print a.search( "cat", 0, 2 )
<_sre.SRE_Match object at 0x7f41df2334a8>
This seems wrong to me, or inconsistent at least.
The documentation on the re
module explicitly mentions that the behavior of ^
does not change due to start/end arguments to re.search
, but no change in behavior is mentioned for $
(that I’ve seen).
Can anyone explain why things were designed this way, and/or suggest a convenient workaround?
By workaround, I would like to compose a regex which always matches the end of the string, even when someone uses the end argument to re.search
.
And why was re.search
designed such that:
s.search( string, endPos=len(string) - 1 )
is the same as
s.search( string[:-1] )
when
s.search( string, startPos=1 )
is explicitly and intentionally not the same as
s.search( string[1:] )
It seems to be less an issue of inconsistency between ^
and $
, and more of an inconsistency within the re.search
function.
Answers:
According to the search()
documentation here:
The optional parameter endpos limits how far the string will be searched; it will be as if the string is endpos characters long, so only the characters from pos to endpos – 1 will be searched for a match.
So your syntax, a.search("cat", 0, 2)
is equivalent to a.search("ca")
, which does match the pattern a$
.
This seems wrong to me, or inconsistent at least.
No, the endpos
interpretation is consistent with the rest of Python, it’s the starting pos
position that’s inconsistent as the documentation explains:
parameter pos gives an index in the string where the search is to
start; it defaults to 0. This is not completely equivalent to slicing
the string; the ‘^’ pattern character matches at the real beginning of
the string
Short Answer
Use A
to and Z
to match the literal beginning or end of a string.
The relevant lines from the re
module’s docs:
6.2.1. Regular Expression Syntax
A
Matches only at the start of the string.
Z
Matches only at the end of the string.
Caveat about endpos
This won’t work “even when someone uses the end argument to re.search
“.
Unlike the “start” parameter pos
, which just marks a starting point, the endpos
parameter means the search (or match) will be conducted on only a portion of the string (emphasis added):
6.2.3. Regular Expression Objects
regex.search(string[, pos[, endpos]]
)
The optional parameter endpos
limits how far the string will be searched;
it will be as if the string is endpos
characters long,
[…]
rx.search(string, 0, 50)
is equivalent to rx.search(string[:50], 0)
.
The Z
matches the end of the string being searched, which is exactly what endpos
changes.
Background
The more-familiar ^
and $
don’t do what you think they do:
^
(Caret.) Matches the start of the string, and in MULTILINE
mode also matches immediately after each newline.
$
Matches the end of the string or just before the newline at the end of the string, and in MULTILINE
mode also matches before a newline.
foo
matches both ‘foo’ and ‘foobar’, while the regular expression foo$
matches only ‘foo’.
More interestingly, searching for foo.$
in 'foo1nfoo2n'
matches ‘foo2’ normally, but ‘foo1’ in MULTILINE
mode;
searching for a single $
in 'foon'
will find two (empty) matches:
one just before the newline, and one at the end of the string.
Python’s regular expressions are heavily influenced by Perl’s, which extended the old grep
abilities with a host of its own.
That included multi-line matching, which raised a question about metacharacters like ^
:
Was it matching the beginning of the string, or the beginning of the line?
When grep
was only matching one line at a time, those were equivalent concepts.
As you can see, ^
and $
ended up trying to match everything “start-like” and “end-ish”.
Perl introduced the new escape sequences A
and z
(lower-case) to match only the start-of-string and end-of-string.
Those escape sequences were adopted by Python, but with one difference:
Python did not adopt Perl’s Z
(upper-case), which matched both end-of-string and the special case newline-before-end-of-string…
making it not quite the partner to A
that one would expect.
(I assume Python upper-cased Perl’s z
for consistency, avoiding the lopsided 'Apatternz'
regexes that were recommended in books like Perl Best Practices.)
History of pos
and endpos
It appears that the strange “not actually the start-start position” meaning of pos
is as old as the parameter itself:
-
The Python 1.4 match
function docs (25 Oct 1996 — probably pre-dating the regex object) don’t show the pos
or endpos
parameters at all.
-
The Python 1.5 match
method docs (17 Feb 1998) introduce both the regular expression object and the pos
and endpos
parameters.
It states that a ^
will match at pos
, although later revisions suggest this was a typo.
(Speaking of typos:
The ^
character itself is missing.
It came and went, until finally reappearing for good(?) in Python 2.1.)
-
The Python 1.5.1 match
method docs (14 Apr 1998) insert the missing “not”, reversing the previous docs.
-
The Python 1.5.1p1 match
method docs (06 Aug 1998) clarify the unexpected effects of pos
.
They match Python 3.6.1’s description of pos
word-for-word…
give or take that pesky ^
typo.
I suspect the numerous changes to the docs over a couple months of bug-fix releases reflect the docs catching up with reality — not changes to the design of match
(although I don’t have Python 1 lying around to verify that).
The python-dev
mailing list archives only go back to 1999, so unless the earlier messages were saved somewhere else, I think answering the “why” question would require guessing who wrote that code, and asking them.
From what I’ve read, ^
should match the start of a string, and $
the end. However, with re.search()
, it looks like the behavior of ^
continues to work fine, while $
‘breaks’. Example:
>>> a = re.compile( "^a" )
>>> print a.search( "cat", 1, 3 )
None
This seems correct to me — 'a'
is not at the start of the string, even if it is at the start of the search.
>>> a = re.compile( "a$" )
>>> print a.search( "cat", 0, 2 )
<_sre.SRE_Match object at 0x7f41df2334a8>
This seems wrong to me, or inconsistent at least.
The documentation on the re
module explicitly mentions that the behavior of ^
does not change due to start/end arguments to re.search
, but no change in behavior is mentioned for $
(that I’ve seen).
Can anyone explain why things were designed this way, and/or suggest a convenient workaround?
By workaround, I would like to compose a regex which always matches the end of the string, even when someone uses the end argument to re.search
.
And why was re.search
designed such that:
s.search( string, endPos=len(string) - 1 )
is the same as
s.search( string[:-1] )
when
s.search( string, startPos=1 )
is explicitly and intentionally not the same as
s.search( string[1:] )
It seems to be less an issue of inconsistency between ^
and $
, and more of an inconsistency within the re.search
function.
According to the search()
documentation here:
The optional parameter endpos limits how far the string will be searched; it will be as if the string is endpos characters long, so only the characters from pos to endpos – 1 will be searched for a match.
So your syntax, a.search("cat", 0, 2)
is equivalent to a.search("ca")
, which does match the pattern a$
.
This seems wrong to me, or inconsistent at least.
No, the endpos
interpretation is consistent with the rest of Python, it’s the starting pos
position that’s inconsistent as the documentation explains:
parameter pos gives an index in the string where the search is to
start; it defaults to 0. This is not completely equivalent to slicing
the string; the ‘^’ pattern character matches at the real beginning of
the string
Short Answer
Use A
to and Z
to match the literal beginning or end of a string.
The relevant lines from the re
module’s docs:
6.2.1. Regular Expression Syntax
A
Matches only at the start of the string.
Z
Matches only at the end of the string.
Caveat about endpos
This won’t work “even when someone uses the end argument to re.search
“.
Unlike the “start” parameter pos
, which just marks a starting point, the endpos
parameter means the search (or match) will be conducted on only a portion of the string (emphasis added):
6.2.3. Regular Expression Objects
regex.search(string[, pos[, endpos]]
)The optional parameter
endpos
limits how far the string will be searched;
it will be as if the string isendpos
characters long,
[…]
rx.search(string, 0, 50)
is equivalent torx.search(string[:50], 0)
.
The Z
matches the end of the string being searched, which is exactly what endpos
changes.
Background
The more-familiar ^
and $
don’t do what you think they do:
^
(Caret.) Matches the start of the string, and inMULTILINE
mode also matches immediately after each newline.
$
Matches the end of the string or just before the newline at the end of the string, and inMULTILINE
mode also matches before a newline.
foo
matches both ‘foo’ and ‘foobar’, while the regular expressionfoo$
matches only ‘foo’.
More interestingly, searching forfoo.$
in'foo1nfoo2n'
matches ‘foo2’ normally, but ‘foo1’ inMULTILINE
mode;
searching for a single$
in'foon'
will find two (empty) matches:
one just before the newline, and one at the end of the string.
Python’s regular expressions are heavily influenced by Perl’s, which extended the old grep
abilities with a host of its own.
That included multi-line matching, which raised a question about metacharacters like ^
:
Was it matching the beginning of the string, or the beginning of the line?
When grep
was only matching one line at a time, those were equivalent concepts.
As you can see, ^
and $
ended up trying to match everything “start-like” and “end-ish”.
Perl introduced the new escape sequences A
and z
(lower-case) to match only the start-of-string and end-of-string.
Those escape sequences were adopted by Python, but with one difference:
Python did not adopt Perl’s Z
(upper-case), which matched both end-of-string and the special case newline-before-end-of-string…
making it not quite the partner to A
that one would expect.
(I assume Python upper-cased Perl’s z
for consistency, avoiding the lopsided 'Apatternz'
regexes that were recommended in books like Perl Best Practices.)
History of pos
and endpos
It appears that the strange “not actually the start-start position” meaning of pos
is as old as the parameter itself:
-
The Python 1.4
match
function docs (25 Oct 1996 — probably pre-dating the regex object) don’t show thepos
orendpos
parameters at all. -
The Python 1.5
match
method docs (17 Feb 1998) introduce both the regular expression object and thepos
andendpos
parameters.
It states that a^
will match atpos
, although later revisions suggest this was a typo.
(Speaking of typos:
The^
character itself is missing.
It came and went, until finally reappearing for good(?) in Python 2.1.) -
The Python 1.5.1
match
method docs (14 Apr 1998) insert the missing “not”, reversing the previous docs. -
The Python 1.5.1p1
match
method docs (06 Aug 1998) clarify the unexpected effects ofpos
.
They match Python 3.6.1’s description ofpos
word-for-word…
give or take that pesky^
typo.
I suspect the numerous changes to the docs over a couple months of bug-fix releases reflect the docs catching up with reality — not changes to the design of match
(although I don’t have Python 1 lying around to verify that).
The python-dev
mailing list archives only go back to 1999, so unless the earlier messages were saved somewhere else, I think answering the “why” question would require guessing who wrote that code, and asking them.