How to un-escape a backslash-escaped string?
Question:
Suppose I have a string which is a backslash-escaped version of another string. Is there an easy way, in Python, to unescape the string? I could, for example, do:
>>> escaped_str = '"Hello,\nworld!"'
>>> raw_str = eval(escaped_str)
>>> print raw_str
Hello,
world!
>>>
However that involves passing a (possibly untrusted) string to eval() which is a security risk. Is there a function in the standard lib which takes a string and produces a string with no security implications?
Answers:
>>> print '"Hello,\nworld!"'.decode('string_escape')
"Hello,
world!"
You can use ast.literal_eval
which is safe:
Safely evaluate an expression node or a string containing a Python
expression. The string or node provided may only consist of the
following Python literal structures: strings, numbers, tuples, lists,
dicts, booleans, and None. (END)
Like this:
>>> import ast
>>> escaped_str = '"Hello,\nworld!"'
>>> print ast.literal_eval(escaped_str)
Hello,
world!
In python 3, str
objects don’t have a decode
method and you have to use a bytes
object. ChristopheD’s answer covers python 2.
# create a `bytes` object from a `str`
my_str = "Hello,\nworld"
# (pick an encoding suitable for your str, e.g. 'latin1')
my_bytes = my_str.encode("utf-8")
# or directly
my_bytes = b"Hello,\nworld"
print(my_bytes.decode("unicode_escape"))
# "Hello,
# world"
All given answers will break on general Unicode strings. The following works for Python3 in all cases, as far as I can tell:
from codecs import encode, decode
sample = u'mon€y\nröcks'
result = decode(encode(sample, 'latin-1', 'backslashreplace'), 'unicode-escape')
print(result)
In recent Python versions, this also works without the import:
sample = u'mon€y\nröcks'
result = sample.encode('latin-1', 'backslashreplace').decode('unicode-escape')
As suggested by obataku, you can also use the literal_eval
method from the ast
module like so:
import ast
sample = u'mon€y\nröcks'
print(ast.literal_eval(F'"{sample}"'))
Or like this when your string really contains a string literal (including the quotes):
import ast
sample = u'"mon€y\nröcks"'
print(ast.literal_eval(sample))
However, if you are uncertain whether the input string uses double or single quotes as delimiters, or when you cannot assume it to be properly escaped at all, then literal_eval
may raise a SyntaxError
while the encode/decode method will still work.
For Python3, consider:
my_string.encode('raw_unicode_escape').decode('unicode_escape')
The ‘raw_unicode_escape’ codec encodes to latin1, but first replaces all other Unicode code points with an escaped 'uXXXX'
or 'UXXXXXXXX'
form. Importantly, it differs from the normal ‘unicode_escape’ codec in that it does not touch existing backslashes.
So when the normal ‘unicode_escape’ decoder is applied, both the newly-escaped code points and the originally-escaped elements are treated equally, and the result is an unescaped native Unicode string.
(The ‘raw_unicode_escape’ decoder appears to pay attention only to the 'uXXXX'
and 'UXXXXXXXX'
forms, ignoring all other escapes.)
Documentation:
https://docs.python.org/3/library/codecs.html?highlight=codecs#text-encodings
custom string parser to decode only some backslash-escapes, in this case "
and '
def backslash_decode(src):
"decode backslash-escapes"
slashes = 0 # count backslashes
dst = ""
for loc in range(0, len(src)):
char = src[loc]
if char == "\":
slashes += 1
if slashes == 2:
dst += char # decode backslash
slashes = 0
elif slashes == 0:
dst += char # normal char
else: # slashes == 1
if char == '"':
dst += char # decode double-quote
elif char == "'":
dst += char # decode single-quote
else:
dst += "\" + char # keep backslash-escapes like n or t
slashes = 0
return dst
src = "a" + "\\" + r"'" + r'"' + r"n" + r"t" + r"x" + "z" # input
exp = "a" + "\" + "'" + '"' + r"n" + r"t" + r"x" + "z" # expected output
res = backslash_decode(src)
print(res)
assert res == exp
Suppose I have a string which is a backslash-escaped version of another string. Is there an easy way, in Python, to unescape the string? I could, for example, do:
>>> escaped_str = '"Hello,\nworld!"'
>>> raw_str = eval(escaped_str)
>>> print raw_str
Hello,
world!
>>>
However that involves passing a (possibly untrusted) string to eval() which is a security risk. Is there a function in the standard lib which takes a string and produces a string with no security implications?
>>> print '"Hello,\nworld!"'.decode('string_escape')
"Hello,
world!"
You can use ast.literal_eval
which is safe:
Safely evaluate an expression node or a string containing a Python
expression. The string or node provided may only consist of the
following Python literal structures: strings, numbers, tuples, lists,
dicts, booleans, and None. (END)
Like this:
>>> import ast
>>> escaped_str = '"Hello,\nworld!"'
>>> print ast.literal_eval(escaped_str)
Hello,
world!
In python 3, str
objects don’t have a decode
method and you have to use a bytes
object. ChristopheD’s answer covers python 2.
# create a `bytes` object from a `str`
my_str = "Hello,\nworld"
# (pick an encoding suitable for your str, e.g. 'latin1')
my_bytes = my_str.encode("utf-8")
# or directly
my_bytes = b"Hello,\nworld"
print(my_bytes.decode("unicode_escape"))
# "Hello,
# world"
All given answers will break on general Unicode strings. The following works for Python3 in all cases, as far as I can tell:
from codecs import encode, decode
sample = u'mon€y\nröcks'
result = decode(encode(sample, 'latin-1', 'backslashreplace'), 'unicode-escape')
print(result)
In recent Python versions, this also works without the import:
sample = u'mon€y\nröcks'
result = sample.encode('latin-1', 'backslashreplace').decode('unicode-escape')
As suggested by obataku, you can also use the literal_eval
method from the ast
module like so:
import ast
sample = u'mon€y\nröcks'
print(ast.literal_eval(F'"{sample}"'))
Or like this when your string really contains a string literal (including the quotes):
import ast
sample = u'"mon€y\nröcks"'
print(ast.literal_eval(sample))
However, if you are uncertain whether the input string uses double or single quotes as delimiters, or when you cannot assume it to be properly escaped at all, then literal_eval
may raise a SyntaxError
while the encode/decode method will still work.
For Python3, consider:
my_string.encode('raw_unicode_escape').decode('unicode_escape')
The ‘raw_unicode_escape’ codec encodes to latin1, but first replaces all other Unicode code points with an escaped 'uXXXX'
or 'UXXXXXXXX'
form. Importantly, it differs from the normal ‘unicode_escape’ codec in that it does not touch existing backslashes.
So when the normal ‘unicode_escape’ decoder is applied, both the newly-escaped code points and the originally-escaped elements are treated equally, and the result is an unescaped native Unicode string.
(The ‘raw_unicode_escape’ decoder appears to pay attention only to the 'uXXXX'
and 'UXXXXXXXX'
forms, ignoring all other escapes.)
Documentation:
https://docs.python.org/3/library/codecs.html?highlight=codecs#text-encodings
custom string parser to decode only some backslash-escapes, in this case "
and '
def backslash_decode(src):
"decode backslash-escapes"
slashes = 0 # count backslashes
dst = ""
for loc in range(0, len(src)):
char = src[loc]
if char == "\":
slashes += 1
if slashes == 2:
dst += char # decode backslash
slashes = 0
elif slashes == 0:
dst += char # normal char
else: # slashes == 1
if char == '"':
dst += char # decode double-quote
elif char == "'":
dst += char # decode single-quote
else:
dst += "\" + char # keep backslash-escapes like n or t
slashes = 0
return dst
src = "a" + "\\" + r"'" + r'"' + r"n" + r"t" + r"x" + "z" # input
exp = "a" + "\" + "'" + '"' + r"n" + r"t" + r"x" + "z" # expected output
res = backslash_decode(src)
print(res)
assert res == exp