How to print non-ascii characters as uXXXX literals

Question:

# what I currently have

print('你好')

# 你好

# this is what I want

print('你好')

# uXXXX uXXXX

How do I do this? I want to print all non-ascii characters in strings as unicode escape literals

Asked By: AlanSTACK

||

Answers:

You can convert strings to a debug representation with non-ASCII, non-printable characters converted to escape sequences using the ascii() function:

As repr(), return a string containing a printable representation of an object, but escape the non-ASCII characters in the string returned by repr() using x, u or U escapes.

For Unicode codepoints in the range U+0100-U+FFFF this uses uhhhh escapes; for the Latin-1 range (U+007F-U+00FF) xhh escapes are used instead. Note that the output qualifies as valid Python syntax to re-create the string, so quotes are included:

>>> print('你好')
你好
>>> print(ascii('你好'))
'u4f60u597d'
>>> print(ascii('ASCII is not changed, Latin-1 (åéîøü) is, as are all higher codepoints, such as 你好'))
'ASCII is not changed, Latin-1 (xe5xe9xeexf8xfc) is, as are all higher codepoints, such as u4f60u597d'

If you must have uhhhh for everything, you’ll have to do your own conversion:

import re

def escape_unicode(t, _p=re.compile(r'[u0080-U0010ffff]')):
    def escape(match):
        char = ord(match.group())
        return '\u{:04x}'.format(char) if char < 0x10000 else '\U{:08x}'.format(char)
    return _p.sub(escape, t)

The above function does not add quotes like the ascii() function does:

>>> print(escape_unicode('你好'))
u4f60u597d
>>> print(escape_unicode('ASCII is not changed, Latin-1 (åéîøü) is, as are all higher codepoints, such as 你好'))
ASCII is not changed, Latin-1 (u00e5u00e9u00eeu00f8u00fc) is, as are all higher codepoints, such as u4f60u597d
Answered By: Martijn Pieters

Do note that what without replacing with \, what you want is not reversible; eg. you can’t know whether the actual string was '好' (one character) or '\u597d' (6 characters in ascii range), since both would produce u597d as output. Martijn’s suggestion does the backslash-replacement, and is reversible.

You could just make the conversion yourself:

def unicodeescape(s):
    return ''.join(c if ord(c) < 128 else '\u%04x' % ord(c) for c in s)

print(unicodeescape('你好'))

(Martijn’s note about characters outside the BMP still applies)

If you want to do this to everything your program outputs, and trying to remember to pass everything through a conversion function doesn’t seem like your idea of a good time, you could also try something like this:

import codecs, sys

def unicodeescapereplace(error):
    if isinstance(error, UnicodeEncodeError):
        s = error.object[error.start:error.end]
        repl = ''.join('\u%04x' % ord(c) for c in s)
        return (repl, error.end)
    raise error

codecs.register_error('unicodeescapereplace', unicodeescapereplace)
sys.stdout = codecs.getwriter('ascii')(sys.stdout.buffer, 'unicodeescapereplace')

print('你好')

This creates a custom encoding error handler, which handles UnicodeEncodeErrors by replacing the offending character with a unicode escape. You can use it like '你好'.encode('ascii', 'unicodeescapereplace'), or like the example above, replace the stdout with one that uses it automatically for all encoding.

Answered By: Aleksi Torhamo

The normal representation is obtained by using the ascii builtin as explain by Martijn Pieters.

If you really want to constently print u escapes, you can do it by hand:

t = 'ASCII is not changed, Latin-1 (åéîøü) is, as are all higher codepoints, such as 你好'
disp = u = "'" + ''.join([c if (ord(c) < 128) else r'u%04x' % (ord(c),) for c in t ]) + "'"
print(disp)
print(eval(disp))

gives as expected:

'ASCII is not changed, Latin-1 (u00e5u00e9u00eeu00f8u00fc) is, as are all higher codepoints, such as u4f60u597d'
ASCII is not changed, Latin-1 (åéîøü) is, as are all higher codepoints, such as 你好

NB: I do know that eval is evil but in that particular use case I know that the inner string contains no ' and that it is enclosed in ' so it can not be more than a mere conversion of encoded characters – but I will never do that on a external string without at least testing t.contains("'")

NB2: this method cannot process correctly characters whose code is greater than 0xffff – it would need another if else

Answered By: Serge Ballesta