What is a unicode string?
Question:
What exactly is a unicode string?
What’s the difference between a regular string and unicode string?
What is utf-8?
I’m trying to learn Python right now, and I keep hearing this buzzword. What does the code below do?
i18n Strings (Unicode)
> ustring = u'A unicode u018e string xf1'
> ustring
u'A unicode u018e string xf1'
## (ustring from above contains a unicode string)
> s = ustring.encode('utf-8')
> s
'A unicode xc6x8e string xc3xb1' ## bytes of utf-8 encoding
> t = unicode(s, 'utf-8') ## Convert bytes back to a unicode string
> t == ustring ## It's the same as the original, yay!
True
Files Unicode
import codecs
f = codecs.open('foo.txt', 'rU', 'utf-8')
for line in f:
# here line is a *unicode* string
Answers:
Python supports the string type and the unicode type. A string is a sequence of chars while a unicode is a sequence of “pointers”. The unicode is an in-memory representation of the sequence and every symbol on it is not a char but a number (in hex format) intended to select a char in a map. So a unicode var does not have encoding because it does not contain chars.
Update: Python 3
In Python 3, Unicode strings are the default. The type str
is a collection of Unicode code points, and the type bytes
is used for representing collections of 8-bit integers (often interpreted as ASCII characters).
Here is the code from the question, updated for Python 3:
>>> my_str = 'A unicode u018e string xf1' # no need for "u" prefix
# the escape sequence "u" denotes a Unicode code point (in hex)
>>> my_str
'A unicode Ǝ string ñ'
# the Unicode code points U+018E and U+00F1 were displayed
# as their corresponding glyphs
>>> my_bytes = my_str.encode('utf-8') # convert to a bytes object
>>> my_bytes
b'A unicode xc6x8e string xc3xb1'
# the "b" prefix means a bytes literal
# the escape sequence "x" denotes a byte using its hex value
# the code points U+018E and U+00F1 were encoded as 2-byte sequences
>>> my_str2 = my_bytes.decode('utf-8') # convert back to str
>>> my_str2 == my_str
True
Working with files:
>>> f = open('foo.txt', 'r') # text mode (Unicode)
>>> # the platform's default encoding (e.g. UTF-8) is used to decode the file
>>> # to set a specific encoding, use open('foo.txt', 'r', encoding="...")
>>> for line in f:
>>> # here line is a str object
>>> f = open('foo.txt', 'rb') # "b" means binary mode (bytes)
>>> for line in f:
>>> # here line is a bytes object
Historical answer: Python 2
In Python 2, the str
type was a collection of 8-bit characters (like Python 3’s bytes
type). The English alphabet can be represented using these 8-bit characters, but symbols such as Ω, и, ±, and ♠ cannot.
Unicode is a standard for working with a wide range of characters. Each symbol has a code point (a number), and these code points can be encoded (converted to a sequence of bytes) using a variety of encodings.
UTF-8 is one such encoding. The low code points are encoded using a single byte, and higher code points are encoded as sequences of bytes.
To allow working with Unicode characters, Python 2 has a unicode
type which is a collection of Unicode code points (like Python 3’s str
type). The line ustring = u'A unicode u018e string xf1'
creates a Unicode string with 20 characters.
When the Python interpreter displays the value of ustring
, it escapes two of the characters (Ǝ and ñ) because they are not in the standard printable range.
The line s = unistring.encode('utf-8')
encodes the Unicode string using UTF-8. This converts each code point to the appropriate byte or sequence of bytes. The result is a collection of bytes, which is returned as a str
. The size of s
is 22 bytes, because two of the characters have high code points and are encoded as a sequence of two bytes rather than a single byte.
When the Python interpreter displays the value of s
, it escapes four bytes that are not in the printable range (xc6
, x8e
, xc3
, and xb1
). The two pairs of bytes are not treated as single characters like before because s
is of type str
, not unicode
.
The line t = unicode(s, 'utf-8')
does the opposite of encode()
. It reconstructs the original code points by looking at the bytes of s
and parsing byte sequences. The result is a Unicode string.
The call to codecs.open()
specifies utf-8
as the encoding, which tells Python to interpret the contents of the file (a collection of bytes) as a Unicode string that has been encoded using UTF-8.
What exactly is a unicode string?
What’s the difference between a regular string and unicode string?
What is utf-8?
I’m trying to learn Python right now, and I keep hearing this buzzword. What does the code below do?
i18n Strings (Unicode)
> ustring = u'A unicode u018e string xf1'
> ustring
u'A unicode u018e string xf1'
## (ustring from above contains a unicode string)
> s = ustring.encode('utf-8')
> s
'A unicode xc6x8e string xc3xb1' ## bytes of utf-8 encoding
> t = unicode(s, 'utf-8') ## Convert bytes back to a unicode string
> t == ustring ## It's the same as the original, yay!
True
Files Unicode
import codecs
f = codecs.open('foo.txt', 'rU', 'utf-8')
for line in f:
# here line is a *unicode* string
Python supports the string type and the unicode type. A string is a sequence of chars while a unicode is a sequence of “pointers”. The unicode is an in-memory representation of the sequence and every symbol on it is not a char but a number (in hex format) intended to select a char in a map. So a unicode var does not have encoding because it does not contain chars.
Update: Python 3
In Python 3, Unicode strings are the default. The type str
is a collection of Unicode code points, and the type bytes
is used for representing collections of 8-bit integers (often interpreted as ASCII characters).
Here is the code from the question, updated for Python 3:
>>> my_str = 'A unicode u018e string xf1' # no need for "u" prefix
# the escape sequence "u" denotes a Unicode code point (in hex)
>>> my_str
'A unicode Ǝ string ñ'
# the Unicode code points U+018E and U+00F1 were displayed
# as their corresponding glyphs
>>> my_bytes = my_str.encode('utf-8') # convert to a bytes object
>>> my_bytes
b'A unicode xc6x8e string xc3xb1'
# the "b" prefix means a bytes literal
# the escape sequence "x" denotes a byte using its hex value
# the code points U+018E and U+00F1 were encoded as 2-byte sequences
>>> my_str2 = my_bytes.decode('utf-8') # convert back to str
>>> my_str2 == my_str
True
Working with files:
>>> f = open('foo.txt', 'r') # text mode (Unicode)
>>> # the platform's default encoding (e.g. UTF-8) is used to decode the file
>>> # to set a specific encoding, use open('foo.txt', 'r', encoding="...")
>>> for line in f:
>>> # here line is a str object
>>> f = open('foo.txt', 'rb') # "b" means binary mode (bytes)
>>> for line in f:
>>> # here line is a bytes object
Historical answer: Python 2
In Python 2, the str
type was a collection of 8-bit characters (like Python 3’s bytes
type). The English alphabet can be represented using these 8-bit characters, but symbols such as Ω, и, ±, and ♠ cannot.
Unicode is a standard for working with a wide range of characters. Each symbol has a code point (a number), and these code points can be encoded (converted to a sequence of bytes) using a variety of encodings.
UTF-8 is one such encoding. The low code points are encoded using a single byte, and higher code points are encoded as sequences of bytes.
To allow working with Unicode characters, Python 2 has a unicode
type which is a collection of Unicode code points (like Python 3’s str
type). The line ustring = u'A unicode u018e string xf1'
creates a Unicode string with 20 characters.
When the Python interpreter displays the value of ustring
, it escapes two of the characters (Ǝ and ñ) because they are not in the standard printable range.
The line s = unistring.encode('utf-8')
encodes the Unicode string using UTF-8. This converts each code point to the appropriate byte or sequence of bytes. The result is a collection of bytes, which is returned as a str
. The size of s
is 22 bytes, because two of the characters have high code points and are encoded as a sequence of two bytes rather than a single byte.
When the Python interpreter displays the value of s
, it escapes four bytes that are not in the printable range (xc6
, x8e
, xc3
, and xb1
). The two pairs of bytes are not treated as single characters like before because s
is of type str
, not unicode
.
The line t = unicode(s, 'utf-8')
does the opposite of encode()
. It reconstructs the original code points by looking at the bytes of s
and parsing byte sequences. The result is a Unicode string.
The call to codecs.open()
specifies utf-8
as the encoding, which tells Python to interpret the contents of the file (a collection of bytes) as a Unicode string that has been encoded using UTF-8.