Python UnicodeDecodeError – Am I misunderstanding encode?

Question:

Any thoughts on why this isn’t working? I really thought ‘ignore’ would do the right thing.

>>> 'add x93Monitoringx93 to list '.encode('latin-1','ignore')
Traceback (most recent call last):
  File "<interactive input>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 4: ordinal not in range(128)
Asked By: Greg

||

Answers:

encode is available to unicode strings, but the string you have there does not seems unicode (try with u’add x93Monitoringx93 to list ‘)

>>> u'add x93Monitoringx93 to list '.encode('latin-1','ignore')
'add x93Monitoringx93 to list '
Answered By: rob

This seems to work:

'add x93Monitoringx93 to list '.decode('latin-1').encode('latin-1')

Any issues with that? I wonder when ‘ignore’, ‘replace’ and other such encode error handling comes in?

Answered By: Greg

… There’s a reason they’re called "encodings" …

A little preamble: think of unicode as the norm, or the ideal state. Unicode is just a table of characters. №65 is latin capital A. №937 is greek capital omega. Just that.

In order for a computer to store and-or manipulate Unicode, it has to encode it into bytes. The most straightforward encoding of Unicode is UCS-4; every character occupies 4 bytes, and all ~1000000 characters are available. The 4 bytes contain the number of the character in the Unicode tables as a 4-byte integer. Another very useful encoding is UTF-8, which can encode any Unicode character with one to four bytes. But there also are some limited encodings, like "latin1", which include a very limited range of characters, mostly used by Western countries. Such encodings use only one byte per character.

Basically, Unicode can be encoded with many encodings, and encoded strings can be decoded to Unicode. The thing is, Unicode came quite late, so all of us that grew up using an 8-bit character set learned too late that all this time we worked with encoded strings. The encoding could be ISO8859-1, or windows CP437, or CP850, or, or, or, depending on our system default.

So when, in your source code, you enter the string "add “Monitoring“ to list" (and I think you wanted the string "add “Monitoring” to list", note the second quote), you actually are using a string already encoded according to your system’s default codepage (by the byte x93 I assume you use Windows codepage 1252, “Western”). If you want to get Unicode from that, you need to decode the string from the "cp1252" encoding.

So, what you meant to do, was:

"add x93Monitoringx94 to list".decode("cp1252", "ignore")

It’s unfortunate that Python 2.x includes an .encode method for strings too; this is a convenience function for "special" encodings, like the "zip" or "rot13" or "base64" ones, which have nothing to do with Unicode.

Anyway, all you have to remember for your to-and-fro Unicode conversions is:

  • a Unicode string gets encoded to a Python 2.x string (actually, a sequence of bytes)
  • a Python 2.x string gets decoded to a Unicode string

In both cases, you need to specify the encoding that will be used.

I’m not very clear, I’m sleepy, but I sure hope I help.

PS A humorous side note: Mayans didn’t have Unicode; ancient Romans, ancient Greeks, ancient Egyptians didn’t too. They all had their own "encodings", and had little to no respect for other cultures. All these civilizations crumbled to dust. Think about it people! Make your apps Unicode-aware, for the good of mankind. 🙂

PS2 Please don’t spoil the previous message by saying "But the Chinese…". If you feel inclined or obligated to do so, though, delay it by thinking that the Unicode BMP is populated mostly by chinese ideograms, ergo Chinese is the basis of Unicode. I can go on inventing outrageous lies, as long as people develop Unicode-aware applications.

Answered By: tzot

And the magic line is:

unicodedata.normalize('NFKD', text).encode('utf-8', 'ignore')

The one liner that wont raise exceptions when it is most needed (remove bad Unicode characters…)

Answered By: rubmz
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.