How to print Unicode like “u{variable}” in Python 2.7?
Question:
For example, I can print Unicode symbol like:
print u'u00E0'
Or
a = u'u00E0'
print a
But it looks like I can’t do something like this:
a = 'u00E0'
print someFunctionToDisplayTheCharacterRepresentedByThisCodePoint(a)
The main use case will be in loops. I have a list of unicode code points and I wish to display them on console. Something like:
with open("someFileWithAListOfUnicodeCodePoints") as uniCodeFile:
for codePoint in uniCodeFile:
print codePoint #I want the console to display the unicode character here
The file has a list of unicode code points. For example:
2109
OOBO
00E4
1F1E6
The loop should output:
℉
°
ä
Any help will be appreciated!
Answers:
This is probably not a great way, but it’s a start:
>>> x = '00e4'
>>> print unicode(struct.pack("!I", int(x, 16)), 'utf_32_be')
ä
First, we get the integer represented by the hexadecimal string x
. We pack that into a byte string, which we can then decode using the utf_32_be
encoding.
Since you are doing this a lot, you can precompile the struct:
int2bytes = struct.Struct("!I").pack
with open("someFileWithAListOfUnicodeCodePoints") as fh:
for code_point in fh:
print unicode(int2bytes(int(code_point, 16)), 'utf_32_be')
If you think it’s clearer, you can also use the decode
method instead of the unicode
type directly:
>>> print int2bytes(int('00e4', 16)).decode('utf_32_be')
ä
Python 3 added a to_bytes
method to the int
class that lets you bypass the struct
module:
>>> str(int('00e4', 16).to_bytes(4, 'big'), 'utf_32_be')
"ä"
These are unicode code points but lack the u
python unicode-escape. So, just put it in:
with open("someFileWithAListOfUnicodeCodePoints", "rb") as uniCodeFile:
for codePoint in uniCodeFile:
print "\u" + codePoint.strip()).decode("unicode-escape")
Whether this works on a given system depends on the console’s encoding. If its a Windows code page and the characters are not in its range, you’ll still get funky errors.
In python 3 that would be b"\u"
.
You want print unichr(int('00E0',16))
. Convert the hex string to an integer and print its Unicode codepoint.
Caveat: On Windows codepoints > U+FFFF won’t work.
Solution: Use Python 3.3+ print(chr(int(line,16)))
In all cases you’ll still need to use a font that supports the glyphs for the codepoints.
For example, I can print Unicode symbol like:
print u'u00E0'
Or
a = u'u00E0'
print a
But it looks like I can’t do something like this:
a = 'u00E0'
print someFunctionToDisplayTheCharacterRepresentedByThisCodePoint(a)
The main use case will be in loops. I have a list of unicode code points and I wish to display them on console. Something like:
with open("someFileWithAListOfUnicodeCodePoints") as uniCodeFile:
for codePoint in uniCodeFile:
print codePoint #I want the console to display the unicode character here
The file has a list of unicode code points. For example:
2109
OOBO
00E4
1F1E6
The loop should output:
℉
°
ä
Any help will be appreciated!
This is probably not a great way, but it’s a start:
>>> x = '00e4'
>>> print unicode(struct.pack("!I", int(x, 16)), 'utf_32_be')
ä
First, we get the integer represented by the hexadecimal string x
. We pack that into a byte string, which we can then decode using the utf_32_be
encoding.
Since you are doing this a lot, you can precompile the struct:
int2bytes = struct.Struct("!I").pack
with open("someFileWithAListOfUnicodeCodePoints") as fh:
for code_point in fh:
print unicode(int2bytes(int(code_point, 16)), 'utf_32_be')
If you think it’s clearer, you can also use the decode
method instead of the unicode
type directly:
>>> print int2bytes(int('00e4', 16)).decode('utf_32_be')
ä
Python 3 added a to_bytes
method to the int
class that lets you bypass the struct
module:
>>> str(int('00e4', 16).to_bytes(4, 'big'), 'utf_32_be')
"ä"
These are unicode code points but lack the u
python unicode-escape. So, just put it in:
with open("someFileWithAListOfUnicodeCodePoints", "rb") as uniCodeFile:
for codePoint in uniCodeFile:
print "\u" + codePoint.strip()).decode("unicode-escape")
Whether this works on a given system depends on the console’s encoding. If its a Windows code page and the characters are not in its range, you’ll still get funky errors.
In python 3 that would be b"\u"
.
You want print unichr(int('00E0',16))
. Convert the hex string to an integer and print its Unicode codepoint.
Caveat: On Windows codepoints > U+FFFF won’t work.
Solution: Use Python 3.3+ print(chr(int(line,16)))
In all cases you’ll still need to use a font that supports the glyphs for the codepoints.