How to convert a string from CP-1251 to UTF-8?

Question:

I’m using mutagen to convert ID3 tags data from CP-1251/CP-1252 to UTF-8. In Linux there is no problem. But on Windows, calling SetValue() on a wx.TextCtrl produces the error:

UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xc3 in position
0: ordinal not in range(128)

The original string (assumed to be CP-1251 encoded) that I’m pulling from mutagen is:

u'xc1xe5xebxe0xff xffxe1xebxfbxedxff xe3xf0xeexecxf3'

I’ve tried converting this to UTF-8:

dd = d.decode('utf-8')

…and even changing the default encoding from ASCII to UTF-8:

sys.setdefaultencoding('utf-8')

…But I get the same error.

Asked By: jsnjack

||

Answers:

If you know for sure that you have cp1251 in your input, you can do

d.decode('cp1251').encode('utf8')
Answered By: Johannes Charra

If d is a correct Unicode string, then d.encode('utf-8') yields an encoded UTF-8 bytestring. Don’t test it by printing, though, it might be that it just doesn’t display properly because of the codepage shenanigans.

Answered By: Cat Plus Plus

Your string d is a Unicode string, not a UTF-8-encoded string! So you can’t decode() it, you must encode() it to UTF-8 or whatever encoding you need.

>>> d = u'xc1xe5xebxe0xff xffxe1xebxfbxedxff xe3xf0xeexecxf3'
>>> d
u'xc1xe5xebxe0xff xffxe1xebxfbxedxff xe3xf0xeexecxf3'
>>> print d
Áåëàÿ ÿáëûíÿ ãðîìó
>>> a.encode("utf-8")
'xc3x81xc3xa5xc3xabxc3xa0xc3xbf xc3xbfxc3xa1xc3xabxc3xbbxc3xadxc3xbf xc3xa3xc3xb0xc3xaexc3xacxc3xb3'

(which is something you’d do at the very end of all processing when you need to save it as a UTF-8 encoded file, for example).

If your input is in a different encoding, it’s the other way around:

>>> d = "Schoßhündchen"                 # native encoding: cp850
>>> d = "Schoßhündchen".decode("cp850") # decode from Windows codepage
>>> d                                   # into a Unicode string (now work with this!)
u'Schoxdfhxfcndchen'
>>> print d                             # it displays correctly if your shell knows the glyphs
Schoßhündchen
>>> d.encode("utf-8")                   # before output, convert to UTF-8
'Schoxc3x9fhxc3xbcndchen'
Answered By: Tim Pietzcker

I provided some relevant info on encoding/decoding text in this response: https://stackoverflow.com/a/34662963/2957811

To add to that here, it’s important to think of text in one of two possible states: ‘encoded’ and ‘decoded’

‘decoded’ means it is in an internal representation by your interpreter/libraries that can be used for character manipulation (e.g. searches, case conversion, substring slicing, character counts, …) or display (looking up a code point in a font and drawing the glyph), but cannot be passed in or out of the running process.

‘encoded’ means it is a byte stream that can be passed around as can any other data, but is not useful for manipulation or display.

If you’ve worked with serialized objects before, consider ‘decoded’ to be the useful object in memory and ‘encoded’ to be the serialized version.

'xc1xe5xebxe0xff xffxe1xebxfbxedxff xe3xf0xeexecxf3' is your encoded (or serialized) version, presumably encoded with cp1251. This encoding needs to be right because that’s the ‘language’ used to serialize the characters and is needed to recreate the characters in memory.

You need to decode this from it’s current encoding (cp1251) into python unicode characters, then re-encode it as a utf8 byte stream. The answerer that suggested d.decode('cp1251').encode('utf8') had this right, I am just hoping to help explain why that should work.

Answered By: user2957811

I lost half of my day to find correct answer. So if you got some unicode string from external source windows-1251 encoded (from web site in my situation) you will see in Linux console something like this:

u’u043au043eu043cu043du0430u0442u043du0430u044f u043au0432u0430u0440u0442u0438u0440u0430…..’

This is not correct unicode presentation of your data. So, Tim Pietzcker is right. You should encode() it first then decode() and then encode again to correct encoding.

So in my case this strange line was saved in “text” variable, and line:

print text.encode("cp1251").decode('cp1251').encode('utf8')   

gave me:

“Своя 2-х комнатная квартира с отличным ремонтом….”

Yes, it makes me crazy too. But it works!

P.S. Saving to file you should do the same way.

some_file.write(text.encode("cp1251").decode('cp1251').encode('utf8'))

I’d rather add a comment to Александр Степаненко answer but my reputation doesn’t yet allow it. I had similar problem of converting MP3 tags from CP-1251 to UTF-8 and the solution of encode/decode/encode worked for me. Except for I had to replace first encoding with “latin-1”, which essentially converts Unicode string into byte sequence without real encoding:

print text.encode("latin-1").decode('cp1251').encode('utf8')

and for saving back using for example mutagen it doesn’t need to be encoded:

audio["title"] = title.encode("latin-1").decode('cp1251')
Answered By: Andrey
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.