Python utf-8 encoding not following unicode rules

Question:

Background: I’ve got a byte file that is encoded using unicode. However, I can’t figure out the right method to get Python to decode it to a string. Sometimes is uses 1-byte ASCII text. The majority of the time it uses 2-byte "plain latin" text, but it can possibly contain any unicode character. So my python program needs to be able to decode that and handle it. Unfortunately byte_string.decode('unicode') isn’t a thing, so I need to specify another encoding scheme. Using Python 3.9

I’ve read through the Python doc on unicode and utf-8 Python doc. If Python uses unicode for it’s strings, and utf-8 as default, this should be pretty straightforward, yet I keep getting incorrect decodes.

If I understand how unicode works, the most significant byte is the character code, and the least significant byte is the lookup value in the decode table. So I would expect 0x00_41 to decode to "A",
0x00_F2 =>enter image description here
x65_03_01 => é (e with combining acute accent).

I wrote a short test file to experiment with these byte combinations, and I’m running into a few situations that I don’t understand (despite extensive reading).

Example code:

def main():
    print("Starting MAIN...")

    vrsn_bytes = b'x76x72x73x6E'
    serato_bytes = b'x00x53x00x65x00x72x00x61x00x74x00x6F'
    special_bytes = b'xB2xF2'  
    combining_bytes = b'x41x75x64x65x03x01'  

    print(f"vrsn_bytes: {vrsn_bytes}")
    print(f"serato_bytes: {serato_bytes}")
    print(f"special_bytes: {special_bytes}")
    print(f"combining_bytes: {combining_bytes}")
    
    encoding_method = 'utf-8'  # also tried latin-1 and cp1252
    vrsn_str = vrsn_bytes.decode(encoding_method)
    serato_str = serato_bytes.decode(encoding_method)
    special_str = special_bytes.decode(encoding_method)
    combining_str = combining_bytes.decode(encoding_method)
    print(f"vrsn_str: {vrsn_str}")
    print(f"serato_str: {serato_str}")
    print(f"special_str: {special_str}")
    print(f"combining_str: {combining_str}")

    return True

if __name__ == '__main__':

    print("Starting Command Line Experiment!")
    
    if not main():
        print("n Command Line Test FAILED!!")
    else:
        print("n Command Line Test PASSED!!")

Issue 1: utf-8 encoding. As the experiment is written, I get the following error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb2 in position 0: invalid start byte

I don’t understand why this fails to decode, according to the unicode decode table, 0x00B2 should be "SUPERSCRIPT TWO". In fact, it seems like anything above 0x7F returns the same UnicodeDecodeError.

I know that some encoding schemes only support 7 bits, which is what seems like is happening, but utf-8 should support not only 8 bits, but multiple bytes.

If I changed encoding_method to encoding_method = 'latin-1' which extends the original ascii 128 characters to 256 characters (up to 0xFF), then I get a better output:

vrsn_str: vrsn
serato_str: Serato
special_str: ²ò
combining_str: Aude

However, this encoding is not handling the 2-byte codes properly. x00_53 should be S, not �S, and none of the encoding methods I’ll mention in this post handle the combining acute accent after Aude properly.

So far I’ve tried many different encoding methods, but the ones that are closest are: unicode_escape, latin-1, and cp1252. while I expect utf-8 to be what I’m supposed to use, it does not behave like it’s described in the Python doc linked above.

Any help is appreciated. Besides trying more methods, I don’t understand why this isn’t decoding according to the table in link 3.

UPDATE:

After some more reading, and see your responses, I understand why you’re so confused. I’m going to explain further so that hopefully this helps someone in the future.

The byte file that I’m decoding is not mine (hence why the encoding does not make sense). What I see now is that the bytes represent the code point, not the byte representation of the unicode character.

For example: I want 0x00_B2 to translate to ò. But the actual byte representation of ò is 0xC3_B2. What I have is the integer representation of the code point. So while I was trying to decode, what I actually need to do is convert 0x00B2 to an integer = 178. then I can use chr(178) to convert to unicode.

I don’t know why the file was written this way, and I can’t change it. But I see now why the decoding wasn’t working. Hopefully this helps someone avoid the frustration I’ve been figuring out.

Thanks!

Asked By: Chaky31

||

Answers:

This isn’t actually a python issue, it’s how you’re encoding the character. To convert a unicode codepoint to utf-8, you do not simply get the bytes from the codepoint position.

For example, the code point U+2192 is →. The actual binary representation in utf-8 is: 0xE28692, or 11100010 10000110 10010010

As we can see, this is 3 bytes, not 2 as we’d expect if we only used the position. To get correct behavior, you can either do the encoding by hand, or use a converter such as this one:

https://onlineunicodetools.com/convert-unicode-to-binary

This will let you input a unicode character and get the utf-8 binary representation.

To get correct output for ò, we need to use 0xC3B2.

>>> s = b'xC3xB2'
>>> print(s.decode('utf-8'))
ò

The reason why you can’t use the direct binary representation is because of the header for the bytes. In utf-8, we can have 1-byte, 2-byte, and 4-byte codepoints. For example, to signify a 1 byte codepoint, the first bit is encoded as a 0. This means that we can only store 2^7 1-byte code points. So, the codepoint U+0080, which is a control character, must be encoded as a 2-byte character such as 11000010 10000000.

For this character, the first byte begins with the header 110, while the second byte begins with the header 10. This means that the data for the codepoint is stored in the last 5 bits of the first byte and the last 6 bits of the second byte. If we combine those, we get
00010 000000, which is equivalent to 0x80.

Answered By: labmonkey398
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.