How to print non-ascii characters as uXXXX literals
Question:
# what I currently have
print('你好')
# 你好
# this is what I want
print('你好')
# uXXXX uXXXX
How do I do this? I want to print all non-ascii characters in strings as unicode escape literals
Answers:
You can convert strings to a debug representation with non-ASCII, non-printable characters converted to escape sequences using the ascii()
function:
As repr()
, return a string containing a printable representation of an object, but escape the non-ASCII characters in the string returned by repr()
using x
, u
or U
escapes.
For Unicode codepoints in the range U+0100-U+FFFF this uses uhhhh
escapes; for the Latin-1 range (U+007F-U+00FF) xhh
escapes are used instead. Note that the output qualifies as valid Python syntax to re-create the string, so quotes are included:
>>> print('你好')
你好
>>> print(ascii('你好'))
'u4f60u597d'
>>> print(ascii('ASCII is not changed, Latin-1 (åéîøü) is, as are all higher codepoints, such as 你好'))
'ASCII is not changed, Latin-1 (xe5xe9xeexf8xfc) is, as are all higher codepoints, such as u4f60u597d'
If you must have uhhhh
for everything, you’ll have to do your own conversion:
import re
def escape_unicode(t, _p=re.compile(r'[u0080-U0010ffff]')):
def escape(match):
char = ord(match.group())
return '\u{:04x}'.format(char) if char < 0x10000 else '\U{:08x}'.format(char)
return _p.sub(escape, t)
The above function does not add quotes like the ascii()
function does:
>>> print(escape_unicode('你好'))
u4f60u597d
>>> print(escape_unicode('ASCII is not changed, Latin-1 (åéîøü) is, as are all higher codepoints, such as 你好'))
ASCII is not changed, Latin-1 (u00e5u00e9u00eeu00f8u00fc) is, as are all higher codepoints, such as u4f60u597d
Do note that what without replacing
with \
, what you want is not reversible; eg. you can’t know whether the actual string was '好'
(one character) or '\u597d'
(6 characters in ascii range), since both would produce u597d
as output. Martijn’s suggestion does the backslash-replacement, and is reversible.
You could just make the conversion yourself:
def unicodeescape(s):
return ''.join(c if ord(c) < 128 else '\u%04x' % ord(c) for c in s)
print(unicodeescape('你好'))
(Martijn’s note about characters outside the BMP still applies)
If you want to do this to everything your program outputs, and trying to remember to pass everything through a conversion function doesn’t seem like your idea of a good time, you could also try something like this:
import codecs, sys
def unicodeescapereplace(error):
if isinstance(error, UnicodeEncodeError):
s = error.object[error.start:error.end]
repl = ''.join('\u%04x' % ord(c) for c in s)
return (repl, error.end)
raise error
codecs.register_error('unicodeescapereplace', unicodeescapereplace)
sys.stdout = codecs.getwriter('ascii')(sys.stdout.buffer, 'unicodeescapereplace')
print('你好')
This creates a custom encoding error handler, which handles UnicodeEncodeErrors by replacing the offending character with a unicode escape. You can use it like '你好'.encode('ascii', 'unicodeescapereplace')
, or like the example above, replace the stdout with one that uses it automatically for all encoding.
The normal representation is obtained by using the ascii
builtin as explain by Martijn Pieters.
If you really want to constently print u escapes, you can do it by hand:
t = 'ASCII is not changed, Latin-1 (åéîøü) is, as are all higher codepoints, such as 你好'
disp = u = "'" + ''.join([c if (ord(c) < 128) else r'u%04x' % (ord(c),) for c in t ]) + "'"
print(disp)
print(eval(disp))
gives as expected:
'ASCII is not changed, Latin-1 (u00e5u00e9u00eeu00f8u00fc) is, as are all higher codepoints, such as u4f60u597d'
ASCII is not changed, Latin-1 (åéîøü) is, as are all higher codepoints, such as 你好
NB: I do know that eval is evil but in that particular use case I know that the inner string contains no '
and that it is enclosed in '
so it can not be more than a mere conversion of encoded characters – but I will never do that on a external string without at least testing t.contains("'")
…
NB2: this method cannot process correctly characters whose code is greater than 0xffff – it would need another if else…
# what I currently have
print('你好')
# 你好
# this is what I want
print('你好')
# uXXXX uXXXX
How do I do this? I want to print all non-ascii characters in strings as unicode escape literals
You can convert strings to a debug representation with non-ASCII, non-printable characters converted to escape sequences using the ascii()
function:
As
repr()
, return a string containing a printable representation of an object, but escape the non-ASCII characters in the string returned byrepr()
usingx
,u
orU
escapes.
For Unicode codepoints in the range U+0100-U+FFFF this uses uhhhh
escapes; for the Latin-1 range (U+007F-U+00FF) xhh
escapes are used instead. Note that the output qualifies as valid Python syntax to re-create the string, so quotes are included:
>>> print('你好')
你好
>>> print(ascii('你好'))
'u4f60u597d'
>>> print(ascii('ASCII is not changed, Latin-1 (åéîøü) is, as are all higher codepoints, such as 你好'))
'ASCII is not changed, Latin-1 (xe5xe9xeexf8xfc) is, as are all higher codepoints, such as u4f60u597d'
If you must have uhhhh
for everything, you’ll have to do your own conversion:
import re
def escape_unicode(t, _p=re.compile(r'[u0080-U0010ffff]')):
def escape(match):
char = ord(match.group())
return '\u{:04x}'.format(char) if char < 0x10000 else '\U{:08x}'.format(char)
return _p.sub(escape, t)
The above function does not add quotes like the ascii()
function does:
>>> print(escape_unicode('你好'))
u4f60u597d
>>> print(escape_unicode('ASCII is not changed, Latin-1 (åéîøü) is, as are all higher codepoints, such as 你好'))
ASCII is not changed, Latin-1 (u00e5u00e9u00eeu00f8u00fc) is, as are all higher codepoints, such as u4f60u597d
Do note that what without replacing with
\
, what you want is not reversible; eg. you can’t know whether the actual string was '好'
(one character) or '\u597d'
(6 characters in ascii range), since both would produce u597d
as output. Martijn’s suggestion does the backslash-replacement, and is reversible.
You could just make the conversion yourself:
def unicodeescape(s):
return ''.join(c if ord(c) < 128 else '\u%04x' % ord(c) for c in s)
print(unicodeescape('你好'))
(Martijn’s note about characters outside the BMP still applies)
If you want to do this to everything your program outputs, and trying to remember to pass everything through a conversion function doesn’t seem like your idea of a good time, you could also try something like this:
import codecs, sys
def unicodeescapereplace(error):
if isinstance(error, UnicodeEncodeError):
s = error.object[error.start:error.end]
repl = ''.join('\u%04x' % ord(c) for c in s)
return (repl, error.end)
raise error
codecs.register_error('unicodeescapereplace', unicodeescapereplace)
sys.stdout = codecs.getwriter('ascii')(sys.stdout.buffer, 'unicodeescapereplace')
print('你好')
This creates a custom encoding error handler, which handles UnicodeEncodeErrors by replacing the offending character with a unicode escape. You can use it like '你好'.encode('ascii', 'unicodeescapereplace')
, or like the example above, replace the stdout with one that uses it automatically for all encoding.
The normal representation is obtained by using the ascii
builtin as explain by Martijn Pieters.
If you really want to constently print u escapes, you can do it by hand:
t = 'ASCII is not changed, Latin-1 (åéîøü) is, as are all higher codepoints, such as 你好'
disp = u = "'" + ''.join([c if (ord(c) < 128) else r'u%04x' % (ord(c),) for c in t ]) + "'"
print(disp)
print(eval(disp))
gives as expected:
'ASCII is not changed, Latin-1 (u00e5u00e9u00eeu00f8u00fc) is, as are all higher codepoints, such as u4f60u597d'
ASCII is not changed, Latin-1 (åéîøü) is, as are all higher codepoints, such as 你好
NB: I do know that eval is evil but in that particular use case I know that the inner string contains no '
and that it is enclosed in '
so it can not be more than a mere conversion of encoded characters – but I will never do that on a external string without at least testing t.contains("'")
…
NB2: this method cannot process correctly characters whose code is greater than 0xffff – it would need another if else…