re.sub(".*", ", "(replacement)", "text") doubles replacement on Python 3.7

Question:

On Python 3.7 (tested on Windows 64 bits), the replacement of a string using the RegEx .* gives the input string repeated twice!

On Python 3.7.2:

>>> import re
>>> re.sub(".*", "(replacement)", "sample text")
'(replacement)(replacement)'

On Python 3.6.4:

>>> import re
>>> re.sub(".*", "(replacement)", "sample text")
'(replacement)'

On Python 2.7.5 (32 bits):

>>> import re
>>> re.sub(".*", "(replacement)", "sample text")
'(replacement)'

What is wrong? How to fix that?

Asked By: Laurent LAPORTE

||

Answers:

This is not a bug, but a bug fix in Python 3.7 from the commit fbb490fd2f38bd817d99c20c05121ad0168a38ee.

In regex, a non-zero-width match moves the pointer position to the end of the match, so that the next assertion, zero-width or not, can continue to match from the position following the match. So in your example, after .* greedily matches and consumes the entire string, the fact that the pointer is then moved to the end of the string still actually leaves “room” for a zero-width match at that position, as can be evident from the following code, which behaves the same in Python 2.7, 3.6 and 3.7:

>>> re.findall(".*", 'sample text')
['sample text', '']

So the bug fix, which is about replacement of a zero-width match right after a non-zero-width match, now correctly replaces both matches with the replacement text.

Answered By: blhsing

This is a common regex issue, it affects a lot of regex flavors, see related

There are several ways to fix the issue:

  • Add anchors on both sides of .*: re.sub("^.*$", "(replacement)", "sample text")
  • Since you want to only match a line once, add the count=1 argument: print( re.sub(".*", "(replacement)", "sample text", count=1) )
  • In case you want to replace any non-empty line, replace * with +: print( re.sub(".+", "(replacement)", "sample text") )

See the Python demo:

import re
# Adding anchors:
print( re.sub("^.*$", "(replacement)", "sample text") ) # => (replacement)
# Using the count=1 argument
print( re.sub(".*", "(replacement)", "sample text", count=1) ) # => (replacement)
# If you want to replace non-empty lines:
print( re.sub(".+", "(replacement)", "sample text") ) # => (replacement)
Answered By: Wiktor Stribiżew
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.