What is this hexadecimal in the utf16 format?

Question

print(bytes('ba', 'utf-16'))

Result :

b'xffxfebx00ax00'

I understand utf-16 means every character will take 16 bits means 00000000 00000000 in binary and i understand there are 16 bits here x00a means x00 = 00000000 and a = 01000001 so both gives x00a it is clear to my mind like this but here is the confusion:

xffxfeb

1 – What is this ?????????

2 – Why fe ??? it should be x00

i have read a lot of wikipedia articles but it is still not clear

Asked By: Essence

||

Source

Answer 1

I think you are misinterpreting the printout.

You have 3 16-bit words:

FFFE: This is the byte-order mark required in UTF-16 (Byte order mark – Wikipedia).
00, followed by the 8-bit encoding of ‘b’ (that is shown as the character ‘b’ instead of using an x escape sequence): This is the 16-bit representation of ‘b’.
00, followed by the 8-bit encoding of ‘a’: This is the 16-bit representation of ‘a’.

Answered By: Fulvio Corno

Answer 2

You have,

b'xffxfebx00ax00'

This is what you asked for, it has three characters.

b'xffxfe' # 0xff 0xfe
b'bx00'    # 0x62 0x00
b'ax00'    # 0x61 0x00

The first is U+FEFF (byte order mark), the second is U+0062 (b), and the third is U+0061 (a). The byte order mark is there to distinguish between little-endian UTF-16 and big-endian UTF-16. It is normal to find a BOM at the beginning of a UTF-16 document.

It is just confusing to read because the 'b' and 'a' look like they’re hexadecimal digits, but they’re not.

If you don’t want the BOM, you can use utf-16le or utf-16be.

>>> bytes('ba', 'utf-16le')
b'bx00ax00'
>>> bytes('ba', 'utf-16be')
b'x00bx00a'

The problem is that you can get some garbage if you decode as the wrong one. If you use UTF-16 with BOM, you’re more likely to get the right result when decoding.

Answered By: Dietrich Epp

Answer 3

You already got your answer I just wanted to explain it in my own words for future readers.

In UTF-16 encoding, It seems that 'a' should occupy 16 bits or 2 bytes. The 'a' itself needs 8 bits. The question is should I put the remaining zeroes before the value of 'a' or after it? There are two possible ways:

First: 01100001|00000000
Second: 00000000|01100001

If I don’t tell you anything and just hand you these, this would happen:

First = b"0110000100000000"
print(hex(int(First, 2)))   # 0x6100
print(chr(int(First, 2)))   # 愀

Second = b"0000000001100001"
print(hex(int(Second, 2)))  # 0x61
print(chr(int(Second, 2)))  # a

So you can’t say anything just by looking at these bytes. Did I mean to send you 愀 or a ?

First Solution:

I myself tell you about this verbally. About the "Ordering"! Here is where "big-endian" and "little-endian" come into play:

bytes_ = b"ax00" # >>>>>> Please decode it with "Little-Endian"!
print(bytes_.decode("utf-16-le"))  # a - Correct.
print(bytes_.decode("utf-16-be"))  # 愀

So If I tell you about the endianness, you can get to the correct character.

You see, without any extra character we were able to achieve this.

Second Solution

I can "embed" the byte ordering into the bytes itself without explicitly telling you! It is called BOM(Byte Order Mark).

ordering1 = b"xfexff"
ordering2 = b"xffxfe"

print((ordering1 + b"x00a").decode("utf-16"))  # a
print((ordering2 + b"ax00").decode("utf-16"))  # a

Now just passing "utf-16" to .decode() is enough. It can figure the correct byte out correctly. There is no need to tell about le or be it’s already there.

Answered By: S.B

What is this hexadecimal in the utf16 format?

Question:

Answers:

First Solution:

Second Solution