Python re.sub back reference not back referencing
Question:
I have the following:
<text top="52" left="20" width="383" height="15" font="0"><b>test</b></text>
and I have the following:
fileText = re.sub("<b>(.*?)</b>", "1", fileText, flags=re.DOTALL)
In which fileText is the string I posted above. When I print out fileText
after I run the regex replacement I get back
<text top="52" left="20" width="383" height="15" font="0"></text>
instead of the expected
<text top="52" left="20" width="383" height="15" font="0">test</text>
Now I am fairly proficient at regex and I know that it should work, in fact I know that it matches properly because I can see it in the groups
when I do a search and print out the groups
but I am new to python and am confused as to why its not working with back references properly
Answers:
You need to use a raw-string here so that the backslash isn’t processed as an escape character:
>>> import re
>>> fileText = '<text top="52" left="20" width="383" height="15" font="0"><b>test</b></text>'
>>> fileText = re.sub("<b>(.*?)</b>", r"1", fileText, flags=re.DOTALL)
>>> fileText
'<text top="52" left="20" width="383" height="15" font="0">test</text>'
>>>
Notice how "1"
was changed to r"1"
. Though it is a very small change (one character), it has a big effect. See below:
>>> "1"
'x01'
>>> r"1"
'\1'
>>>
I have the following:
<text top="52" left="20" width="383" height="15" font="0"><b>test</b></text>
and I have the following:
fileText = re.sub("<b>(.*?)</b>", "1", fileText, flags=re.DOTALL)
In which fileText is the string I posted above. When I print out fileText
after I run the regex replacement I get back
<text top="52" left="20" width="383" height="15" font="0"></text>
instead of the expected
<text top="52" left="20" width="383" height="15" font="0">test</text>
Now I am fairly proficient at regex and I know that it should work, in fact I know that it matches properly because I can see it in the groups
when I do a search and print out the groups
but I am new to python and am confused as to why its not working with back references properly
You need to use a raw-string here so that the backslash isn’t processed as an escape character:
>>> import re
>>> fileText = '<text top="52" left="20" width="383" height="15" font="0"><b>test</b></text>'
>>> fileText = re.sub("<b>(.*?)</b>", r"1", fileText, flags=re.DOTALL)
>>> fileText
'<text top="52" left="20" width="383" height="15" font="0">test</text>'
>>>
Notice how "1"
was changed to r"1"
. Though it is a very small change (one character), it has a big effect. See below:
>>> "1"
'x01'
>>> r"1"
'\1'
>>>