What is a unicode string?

Question:

What exactly is a unicode string?

What’s the difference between a regular string and unicode string?

What is utf-8?

I’m trying to learn Python right now, and I keep hearing this buzzword. What does the code below do?

i18n Strings (Unicode)

> ustring = u'A unicode u018e string xf1'
> ustring
u'A unicode u018e string xf1'

## (ustring from above contains a unicode string)
> s = ustring.encode('utf-8')
> s
'A unicode xc6x8e string xc3xb1'  ## bytes of utf-8 encoding
> t = unicode(s, 'utf-8')             ## Convert bytes back to a unicode string
> t == ustring                      ## It's the same as the original, yay!
True

Files Unicode

import codecs

f = codecs.open('foo.txt', 'rU', 'utf-8')
for line in f:
# here line is a *unicode* string
Asked By: Stevanus Iskandar

||

Answers:

Python supports the string type and the unicode type. A string is a sequence of chars while a unicode is a sequence of “pointers”. The unicode is an in-memory representation of the sequence and every symbol on it is not a char but a number (in hex format) intended to select a char in a map. So a unicode var does not have encoding because it does not contain chars.

Answered By: Renjith Nair

Update: Python 3

In Python 3, Unicode strings are the default. The type str is a collection of Unicode code points, and the type bytes is used for representing collections of 8-bit integers (often interpreted as ASCII characters).

Here is the code from the question, updated for Python 3:

>>> my_str = 'A unicode u018e string xf1' # no need for "u" prefix
# the escape sequence "u" denotes a Unicode code point (in hex)
>>> my_str
'A unicode Ǝ string ñ'
# the Unicode code points U+018E and U+00F1 were displayed
# as their corresponding glyphs
>>> my_bytes = my_str.encode('utf-8') # convert to a bytes object
>>> my_bytes
b'A unicode xc6x8e string xc3xb1'
# the "b" prefix means a bytes literal
# the escape sequence "x" denotes a byte using its hex value
# the code points U+018E and U+00F1 were encoded as 2-byte sequences
>>> my_str2 = my_bytes.decode('utf-8') # convert back to str
>>> my_str2 == my_str
True

Working with files:

>>> f = open('foo.txt', 'r') # text mode (Unicode)
>>> # the platform's default encoding (e.g. UTF-8) is used to decode the file
>>> # to set a specific encoding, use open('foo.txt', 'r', encoding="...")
>>> for line in f:
>>>     # here line is a str object

>>> f = open('foo.txt', 'rb') # "b" means binary mode (bytes)
>>> for line in f:
>>>     # here line is a bytes object

Historical answer: Python 2

In Python 2, the str type was a collection of 8-bit characters (like Python 3’s bytes type). The English alphabet can be represented using these 8-bit characters, but symbols such as Ω, и, ±, and ♠ cannot.

Unicode is a standard for working with a wide range of characters. Each symbol has a code point (a number), and these code points can be encoded (converted to a sequence of bytes) using a variety of encodings.

UTF-8 is one such encoding. The low code points are encoded using a single byte, and higher code points are encoded as sequences of bytes.

To allow working with Unicode characters, Python 2 has a unicode type which is a collection of Unicode code points (like Python 3’s str type). The line ustring = u'A unicode u018e string xf1' creates a Unicode string with 20 characters.

When the Python interpreter displays the value of ustring, it escapes two of the characters (Ǝ and ñ) because they are not in the standard printable range.

The line s = unistring.encode('utf-8') encodes the Unicode string using UTF-8. This converts each code point to the appropriate byte or sequence of bytes. The result is a collection of bytes, which is returned as a str. The size of s is 22 bytes, because two of the characters have high code points and are encoded as a sequence of two bytes rather than a single byte.

When the Python interpreter displays the value of s, it escapes four bytes that are not in the printable range (xc6, x8e, xc3, and xb1). The two pairs of bytes are not treated as single characters like before because s is of type str, not unicode.

The line t = unicode(s, 'utf-8') does the opposite of encode(). It reconstructs the original code points by looking at the bytes of s and parsing byte sequences. The result is a Unicode string.

The call to codecs.open() specifies utf-8 as the encoding, which tells Python to interpret the contents of the file (a collection of bytes) as a Unicode string that has been encoded using UTF-8.

Answered By: tom
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.