Inconsistency between $ and ^ in regex when using start/end arguments to re.search?

Question:

From what I’ve read, ^ should match the start of a string, and $ the end. However, with re.search(), it looks like the behavior of ^ continues to work fine, while $ ‘breaks’. Example:

>>> a = re.compile( "^a" )
>>> print a.search( "cat", 1, 3 )
None

This seems correct to me — 'a' is not at the start of the string, even if it is at the start of the search.

>>> a = re.compile( "a$" )
>>> print a.search( "cat", 0, 2 )
<_sre.SRE_Match object at 0x7f41df2334a8>

This seems wrong to me, or inconsistent at least.

The documentation on the re module explicitly mentions that the behavior of ^ does not change due to start/end arguments to re.search, but no change in behavior is mentioned for $ (that I’ve seen).

Can anyone explain why things were designed this way, and/or suggest a convenient workaround?

By workaround, I would like to compose a regex which always matches the end of the string, even when someone uses the end argument to re.search.

And why was re.search designed such that:

s.search( string, endPos=len(string) - 1 )

is the same as

s.search( string[:-1] )

when

s.search( string, startPos=1 )

is explicitly and intentionally not the same as

s.search( string[1:] )

It seems to be less an issue of inconsistency between ^ and $, and more of an inconsistency within the re.search function.

Asked By: bgutt3r

||

Answers:

According to the search() documentation here:

The optional parameter endpos limits how far the string will be searched; it will be as if the string is endpos characters long, so only the characters from pos to endpos – 1 will be searched for a match.

So your syntax, a.search("cat", 0, 2) is equivalent to a.search("ca"), which does match the pattern a$.

Answered By: gpanders

This seems wrong to me, or inconsistent at least.

No, the endpos interpretation is consistent with the rest of Python, it’s the starting pos position that’s inconsistent as the documentation explains:

parameter pos gives an index in the string where the search is to
start; it defaults to 0. This is not completely equivalent to slicing
the string; the ‘^’ pattern character matches at the real beginning of
the string

Answered By: cdlane

Short Answer

Use A to and Z to match the literal beginning or end of a string.
The relevant lines from the re module’s docs:

6.2.1. Regular Expression Syntax

A
Matches only at the start of the string.

Z
Matches only at the end of the string.

Caveat about endpos

This won’t work “even when someone uses the end argument to re.search“.
Unlike the “start” parameter pos, which just marks a starting point, the endpos parameter means the search (or match) will be conducted on only a portion of the string (emphasis added):

6.2.3. Regular Expression Objects

regex.search(string[, pos[, endpos]])

The optional parameter endpos limits how far the string will be searched;
it will be as if the string is endpos characters long,
[…]
rx.search(string, 0, 50) is equivalent to rx.search(string[:50], 0).

The Z matches the end of the string being searched, which is exactly what endpos changes.

Background

The more-familiar ^ and $ don’t do what you think they do:

^
(Caret.) Matches the start of the string, and in MULTILINE mode also matches immediately after each newline.

$
Matches the end of the string or just before the newline at the end of the string, and in MULTILINE mode also matches before a newline.
foo matches both ‘foo’ and ‘foobar’, while the regular expression foo$ matches only ‘foo’.
More interestingly, searching for foo.$ in 'foo1nfoo2n' matches ‘foo2’ normally, but ‘foo1’ in MULTILINE mode;
searching for a single $ in 'foon' will find two (empty) matches:
one just before the newline, and one at the end of the string.

Python’s regular expressions are heavily influenced by Perl’s, which extended the old grep abilities with a host of its own.
That included multi-line matching, which raised a question about metacharacters like ^:
Was it matching the beginning of the string, or the beginning of the line?
When grep was only matching one line at a time, those were equivalent concepts.

As you can see, ^ and $ ended up trying to match everything “start-like” and “end-ish”.
Perl introduced the new escape sequences A and z (lower-case) to match only the start-of-string and end-of-string.

Those escape sequences were adopted by Python, but with one difference:
Python did not adopt Perl’s Z (upper-case), which matched both end-of-string and the special case newline-before-end-of-string…
making it not quite the partner to A that one would expect.

(I assume Python upper-cased Perl’s z for consistency, avoiding the lopsided 'Apatternz' regexes that were recommended in books like Perl Best Practices.)

History of pos and endpos

It appears that the strange “not actually the start-start position” meaning of pos is as old as the parameter itself:

  • The Python 1.4 match function docs (25 Oct 1996 — probably pre-dating the regex object) don’t show the pos or endpos parameters at all.

  • The Python 1.5 match method docs (17 Feb 1998) introduce both the regular expression object and the pos and endpos parameters.
    It states that a ^ will match at pos, although later revisions suggest this was a typo.
    (Speaking of typos:
    The ^ character itself is missing.
    It came and went, until finally reappearing for good(?) in Python 2.1.)

  • The Python 1.5.1 match method docs (14 Apr 1998) insert the missing “not”, reversing the previous docs.

  • The Python 1.5.1p1 match method docs (06 Aug 1998) clarify the unexpected effects of pos.
    They match Python 3.6.1’s description of pos word-for-word…
    give or take that pesky ^ typo.

I suspect the numerous changes to the docs over a couple months of bug-fix releases reflect the docs catching up with reality — not changes to the design of match
(although I don’t have Python 1 lying around to verify that).

The python-dev mailing list archives only go back to 1999, so unless the earlier messages were saved somewhere else, I think answering the “why” question would require guessing who wrote that code, and asking them.

Answered By: Kevin J. Chase
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.