re.sub(".*", ", "(replacement)", "text") doubles replacement on Python 3.7
Question:
On Python 3.7 (tested on Windows 64 bits), the replacement of a string using the RegEx .*
gives the input string repeated twice!
On Python 3.7.2:
>>> import re
>>> re.sub(".*", "(replacement)", "sample text")
'(replacement)(replacement)'
On Python 3.6.4:
>>> import re
>>> re.sub(".*", "(replacement)", "sample text")
'(replacement)'
On Python 2.7.5 (32 bits):
>>> import re
>>> re.sub(".*", "(replacement)", "sample text")
'(replacement)'
What is wrong? How to fix that?
Answers:
This is not a bug, but a bug fix in Python 3.7 from the commit fbb490fd2f38bd817d99c20c05121ad0168a38ee.
In regex, a non-zero-width match moves the pointer position to the end of the match, so that the next assertion, zero-width or not, can continue to match from the position following the match. So in your example, after .*
greedily matches and consumes the entire string, the fact that the pointer is then moved to the end of the string still actually leaves “room” for a zero-width match at that position, as can be evident from the following code, which behaves the same in Python 2.7, 3.6 and 3.7:
>>> re.findall(".*", 'sample text')
['sample text', '']
So the bug fix, which is about replacement of a zero-width match right after a non-zero-width match, now correctly replaces both matches with the replacement text.
This is a common regex issue, it affects a lot of regex flavors, see related
- language-agnostic : Why do some regex engines match .* twice in a single input string?
- java : String.replaceAll(regex) makes the same replacement twice
There are several ways to fix the issue:
- Add anchors on both sides of
.*
: re.sub("^.*$", "(replacement)", "sample text")
- Since you want to only match a line once, add the
count=1
argument: print( re.sub(".*", "(replacement)", "sample text", count=1) )
- In case you want to replace any non-empty line, replace
*
with +
: print( re.sub(".+", "(replacement)", "sample text") )
See the Python demo:
import re
# Adding anchors:
print( re.sub("^.*$", "(replacement)", "sample text") ) # => (replacement)
# Using the count=1 argument
print( re.sub(".*", "(replacement)", "sample text", count=1) ) # => (replacement)
# If you want to replace non-empty lines:
print( re.sub(".+", "(replacement)", "sample text") ) # => (replacement)
On Python 3.7 (tested on Windows 64 bits), the replacement of a string using the RegEx .*
gives the input string repeated twice!
On Python 3.7.2:
>>> import re
>>> re.sub(".*", "(replacement)", "sample text")
'(replacement)(replacement)'
On Python 3.6.4:
>>> import re
>>> re.sub(".*", "(replacement)", "sample text")
'(replacement)'
On Python 2.7.5 (32 bits):
>>> import re
>>> re.sub(".*", "(replacement)", "sample text")
'(replacement)'
What is wrong? How to fix that?
This is not a bug, but a bug fix in Python 3.7 from the commit fbb490fd2f38bd817d99c20c05121ad0168a38ee.
In regex, a non-zero-width match moves the pointer position to the end of the match, so that the next assertion, zero-width or not, can continue to match from the position following the match. So in your example, after .*
greedily matches and consumes the entire string, the fact that the pointer is then moved to the end of the string still actually leaves “room” for a zero-width match at that position, as can be evident from the following code, which behaves the same in Python 2.7, 3.6 and 3.7:
>>> re.findall(".*", 'sample text')
['sample text', '']
So the bug fix, which is about replacement of a zero-width match right after a non-zero-width match, now correctly replaces both matches with the replacement text.
This is a common regex issue, it affects a lot of regex flavors, see related
- language-agnostic : Why do some regex engines match .* twice in a single input string?
- java : String.replaceAll(regex) makes the same replacement twice
There are several ways to fix the issue:
- Add anchors on both sides of
.*
:re.sub("^.*$", "(replacement)", "sample text")
- Since you want to only match a line once, add the
count=1
argument:print( re.sub(".*", "(replacement)", "sample text", count=1) )
- In case you want to replace any non-empty line, replace
*
with+
:print( re.sub(".+", "(replacement)", "sample text") )
See the Python demo:
import re
# Adding anchors:
print( re.sub("^.*$", "(replacement)", "sample text") ) # => (replacement)
# Using the count=1 argument
print( re.sub(".*", "(replacement)", "sample text", count=1) ) # => (replacement)
# If you want to replace non-empty lines:
print( re.sub(".+", "(replacement)", "sample text") ) # => (replacement)