Convert "x" escaped string into readable string in python
Question:
Is there a way to convert a x
escaped string like "\xe8\xaa\x9e\xe8\xa8\x80"
into readable form: "語言"
?
>>> a = "\xe8\xaa\x9e\xe8\xa8\x80"
>>> print(a)
xe8xaax9exe8xa8x80
I am aware that there is a similar question here, but it seems the solution is only for latin characters. How can I convert this form of string into readable CJK characters?
Answers:
Decode it first using ‘unicode-escape’, then as ‘utf8’:
a = "\xe8\xaa\x9e\xe8\xa8\x80"
decoded = a.encode('latin1').decode('unicode_escape').encode('latin1').decode('utf8')
print(decoded)
# 語言
Note that since we can only decode bytes objects, we need to transparently encode it in between, using ‘latin1’.
Starting with string a
which appears to follow python’s hex escaping rules, you can decode it to a bytes object plus length of string decoded.
>>> a = "\xe8\xaa\x9e\xe8\xa8\x80"
>>> import codecs
>>> codecs.escape_decode(a)
(b'xe8xaax9exe8xa8x80', 24)
You don’t need the length here, so just get item 0. Now its time for some guessing. Assuming that this string actually represented a utf-8 encoding, you now have a bytes array that you can decode
>>> codecs.escape_decode(a)[0].decode('utf-8')
'語言'
If the underlying encoding was different (say, a Windows CJK code page), you’d have to decode with its decoder.
Text like this could make a valid Python bytes literal. Assuming we don’t have to worry about invalid input, we can simply construct a string that looks like the corresponding source code, and use ast.literal_eval
to interpret it that way (this is safe, unlike using eval
). Finally we decode the resulting bytes
as UTF-8. Thus:
>>> a = "\xe8\xaa\x9e\xe8\xa8\x80"
>>> ast.literal_eval(f"b'{a}'")
b'xe8xaax9exe8xa8x80'
>>> ast.literal_eval(f"b'{a}'").decode('utf-8')
'語言'
Such a codec is missing in stdlib. My package all-escapes registers a codec which can be used:
>>> a = "\xe8\xaa\x9e\xe8\xa8\x80"
>>> a.encode('all-escapes').decode()
'語言'
Is there a way to convert a x
escaped string like "\xe8\xaa\x9e\xe8\xa8\x80"
into readable form: "語言"
?
>>> a = "\xe8\xaa\x9e\xe8\xa8\x80"
>>> print(a)
xe8xaax9exe8xa8x80
I am aware that there is a similar question here, but it seems the solution is only for latin characters. How can I convert this form of string into readable CJK characters?
Decode it first using ‘unicode-escape’, then as ‘utf8’:
a = "\xe8\xaa\x9e\xe8\xa8\x80"
decoded = a.encode('latin1').decode('unicode_escape').encode('latin1').decode('utf8')
print(decoded)
# 語言
Note that since we can only decode bytes objects, we need to transparently encode it in between, using ‘latin1’.
Starting with string a
which appears to follow python’s hex escaping rules, you can decode it to a bytes object plus length of string decoded.
>>> a = "\xe8\xaa\x9e\xe8\xa8\x80"
>>> import codecs
>>> codecs.escape_decode(a)
(b'xe8xaax9exe8xa8x80', 24)
You don’t need the length here, so just get item 0. Now its time for some guessing. Assuming that this string actually represented a utf-8 encoding, you now have a bytes array that you can decode
>>> codecs.escape_decode(a)[0].decode('utf-8')
'語言'
If the underlying encoding was different (say, a Windows CJK code page), you’d have to decode with its decoder.
Text like this could make a valid Python bytes literal. Assuming we don’t have to worry about invalid input, we can simply construct a string that looks like the corresponding source code, and use ast.literal_eval
to interpret it that way (this is safe, unlike using eval
). Finally we decode the resulting bytes
as UTF-8. Thus:
>>> a = "\xe8\xaa\x9e\xe8\xa8\x80"
>>> ast.literal_eval(f"b'{a}'")
b'xe8xaax9exe8xa8x80'
>>> ast.literal_eval(f"b'{a}'").decode('utf-8')
'語言'
Such a codec is missing in stdlib. My package all-escapes registers a codec which can be used:
>>> a = "\xe8\xaa\x9e\xe8\xa8\x80"
>>> a.encode('all-escapes').decode()
'語言'