how to get re.sub to add a single backslash between group placeholders

Question:

When trying to build a regex to escape "s in a string, I run into an issue where I can’t get the right # of backslashes to get the desired (") output.

data="""    {
    value1: "blah",
    value2: 'foo<a href="example.com">bar</a>',
}"""

the pattern works fine with another character (e.g. 1!2 -> !"):

>>> re.sub(r'(.*?)(".*?)',r'1!2',data, re.MULTILINE)
'    {n        value1: !"blah!",n        value2: 'foo<a href=!"example.com!">bar</a>',n    }'

but backslashes alone doesn’t seem to be able to escape [as expected]:

>>> re.sub(r'(.*?)(".*?)',r'1\2',data, re.MULTILINE)
"    {n        value1: \2blah\2,n        value2: 'foo<a href=\2example.com\2>bar</a>',n    }"

>>> re.sub(r'(.*?)(".*?)',r'1\2',data, re.MULTILINE)
'    {n        value1: \"blah\",n        value2: 'foo<a href=\"example.com\">bar</a>',n    }'

>>> re.sub(r'(.*?)(".*?)',r'1\\2',data, re.MULTILINE)
"    {n        value1: \\2blah\\2,n        value2: 'foo<a href=\\2example.com\\2>bar</a>',n    }"

>>> re.sub(r'(.*?)(".*?)',r'1\\2',data, re.MULTILINE)
'    {n        value1: \\"blah\\",n        value2: 'foo<a href=\\"example.com\\">bar</a>',n    }'

and without raw strings:

>>> re.sub(r'(.*?)(".*?)','\1!\2',data, re.MULTILINE)
'    {n        value1: !"blah!",n        value2: 'foo<a href=!"example.com!">bar</a>',n    }'

>>> re.sub(r'(.*?)(".*?)','\1\2',data, re.MULTILINE)
"    {n        value1: \x02blah\x02,n        value2: 'foo<a href=\x02example.com\x02>bar</a>',n    }"

>>> re.sub(r'(.*?)(".*?)','\1\\2',data, re.MULTILINE)
"    {n        value1: \2blah\2,n        value2: 'foo<a href=\2example.com\2>bar</a>',n    }"

>>> re.sub(r'(.*?)(".*?)','\1\\2',data, re.MULTILINE)
"    {n        value1: \x02blah\x02,n        value2: 'foo<a href=\x02example.com\x02>bar</a>',n    }"

>>> re.sub(r'(.*?)(".*?)','\1\\\2',data, re.MULTILINE)
'    {n        value1: \"blah\",n        value2: 'foo<a href=\"example.com\">bar</a>',n    }'

There will always be too many backslashes in the result (using even # in sub) or the group’s backslash (2) will get escaped-leaving only the group-number in the output.

I think I need something akin to bash’s ${varName}PM, where without the curly-braces $varNamePM would look for a variable named varNamePM instead of concatenating the content of varName with the string PM.

(output also seems to be the same without re.MULTILINE)

(using \g<1> to specify the capture groups also didn’t help. ref: https://stackoverflow.com/a/5984688/10761353)

UPDATE:
Per @marcel-wilson’s answer, here’s the functional result:

>>> res = re.sub(r'(.*?)(".*?)',r'1\2',data, re.MULTILINE)
>>> res
'    {n        value1: \"blah\",n        value2: 'foo<a href=\"example.com\">bar</a>',n    }'
>>> print(res)
    {
        value1: "blah",
        value2: 'foo<a href="example.com">bar</a>',
    }
[ manually replace single- -> dbl-quotes & remove trailing `,` on value2 ]
>>> res2
'    {n        "value1": "blah",n        "value2": "foo<a href=\"example.com\">bar</a>"n    }'
>>> print(res2)
    {
        "value1": "blah",
        "value2": "foo<a href="example.com">bar</a>"
    }
>>> json.loads(res2)
{'value1': 'blah', 'value2': 'foo<a href="example.com">bar</a>'}
Asked By: Adam Smooch

||

Answers:

I think it’s important to point out there is a fundamental difference between how a string is represented vs how it is printed.

When you run re.sub() in the console the output on screen is showing you the equivalent of the raw of the returned string.

A good way to see the difference:

>>> x = re.sub(r'(.*?)(".*?)',r'1\2',data, re.MULTILINE)
>>> x
'    {n    value1: \"blah\",n    value2: 'foo<a href=\"example.com\">bar</a>',n}'
>>> print(x)
    {
    value1: "blah",
    value2: 'foo<a href="example.com">bar</a>',
}

Notice the PRINTED string has the right number of backslashes in front of the double quotes.

explanation

The difference is between str() and repr().

repr() shows you the "code equivalent" of the string. If you were to directly copy and paste it into your script it would create the string properly.

str() shows you how the string would look when printing it.

The problem I think that’s causing you so much issue is when you run something in console it effectively is doing the following without telling you it’s doing so:

>>> x
# is the equivalent of 
>>> print(repr(x))
# but not at all the same thing as 
>>> print(x)
Answered By: Marcel Wilson
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.