What is this hexadecimal in the utf16 format?
Question:
print(bytes('ba', 'utf-16'))
Result :
b'xffxfebx00ax00'
I understand utf-16 means every character will take 16 bits means 00000000 00000000
in binary and i understand there are 16 bits here x00a
means x00 = 00000000
and a = 01000001
so both gives x00a
it is clear to my mind like this but here is the confusion:
xffxfeb
1 – What is this ?????????
2 – Why fe
??? it should be x00
i have read a lot of wikipedia articles but it is still not clear
Answers:
I think you are misinterpreting the printout.
You have 3 16-bit words:
FFFE
: This is the byte-order mark required in UTF-16 (Byte order mark – Wikipedia).
00
, followed by the 8-bit encoding of ‘b’ (that is shown as the character ‘b’ instead of using an x
escape sequence): This is the 16-bit representation of ‘b’.
00
, followed by the 8-bit encoding of ‘a’: This is the 16-bit representation of ‘a’.
You have,
b'xffxfebx00ax00'
This is what you asked for, it has three characters.
b'xffxfe' # 0xff 0xfe
b'bx00' # 0x62 0x00
b'ax00' # 0x61 0x00
The first is U+FEFF (byte order mark), the second is U+0062 (b), and the third is U+0061 (a). The byte order mark is there to distinguish between little-endian UTF-16 and big-endian UTF-16. It is normal to find a BOM at the beginning of a UTF-16 document.
It is just confusing to read because the 'b'
and 'a'
look like they’re hexadecimal digits, but they’re not.
If you don’t want the BOM, you can use utf-16le
or utf-16be
.
>>> bytes('ba', 'utf-16le')
b'bx00ax00'
>>> bytes('ba', 'utf-16be')
b'x00bx00a'
The problem is that you can get some garbage if you decode as the wrong one. If you use UTF-16 with BOM, you’re more likely to get the right result when decoding.
You already got your answer I just wanted to explain it in my own words for future readers.
In UTF-16 encoding, It seems that 'a'
should occupy 16 bits or 2 bytes. The 'a'
itself needs 8 bits. The question is should I put the remaining zeroes before the value of 'a'
or after it? There are two possible ways:
First: 01100001|00000000
Second: 00000000|01100001
If I don’t tell you anything and just hand you these, this would happen:
First = b"0110000100000000"
print(hex(int(First, 2))) # 0x6100
print(chr(int(First, 2))) # 愀
Second = b"0000000001100001"
print(hex(int(Second, 2))) # 0x61
print(chr(int(Second, 2))) # a
So you can’t say anything just by looking at these bytes. Did I mean to send you 愀
or a
?
First Solution:
I myself tell you about this verbally. About the "Ordering"! Here is where "big-endian" and "little-endian" come into play:
bytes_ = b"ax00" # >>>>>> Please decode it with "Little-Endian"!
print(bytes_.decode("utf-16-le")) # a - Correct.
print(bytes_.decode("utf-16-be")) # 愀
So If I tell you about the endianness, you can get to the correct character.
You see, without any extra character we were able to achieve this.
Second Solution
I can "embed" the byte ordering into the bytes itself without explicitly telling you! It is called BOM(Byte Order Mark).
ordering1 = b"xfexff"
ordering2 = b"xffxfe"
print((ordering1 + b"x00a").decode("utf-16")) # a
print((ordering2 + b"ax00").decode("utf-16")) # a
Now just passing "utf-16"
to .decode()
is enough. It can figure the correct byte out correctly. There is no need to tell about le
or be
it’s already there.
print(bytes('ba', 'utf-16'))
Result :
b'xffxfebx00ax00'
I understand utf-16 means every character will take 16 bits means 00000000 00000000
in binary and i understand there are 16 bits here x00a
means x00 = 00000000
and a = 01000001
so both gives x00a
it is clear to my mind like this but here is the confusion:
xffxfeb
1 – What is this ?????????
2 – Why fe
??? it should be x00
i have read a lot of wikipedia articles but it is still not clear
I think you are misinterpreting the printout.
You have 3 16-bit words:
FFFE
: This is the byte-order mark required in UTF-16 (Byte order mark – Wikipedia).00
, followed by the 8-bit encoding of ‘b’ (that is shown as the character ‘b’ instead of using anx
escape sequence): This is the 16-bit representation of ‘b’.00
, followed by the 8-bit encoding of ‘a’: This is the 16-bit representation of ‘a’.
You have,
b'xffxfebx00ax00'
This is what you asked for, it has three characters.
b'xffxfe' # 0xff 0xfe
b'bx00' # 0x62 0x00
b'ax00' # 0x61 0x00
The first is U+FEFF (byte order mark), the second is U+0062 (b), and the third is U+0061 (a). The byte order mark is there to distinguish between little-endian UTF-16 and big-endian UTF-16. It is normal to find a BOM at the beginning of a UTF-16 document.
It is just confusing to read because the 'b'
and 'a'
look like they’re hexadecimal digits, but they’re not.
If you don’t want the BOM, you can use utf-16le
or utf-16be
.
>>> bytes('ba', 'utf-16le')
b'bx00ax00'
>>> bytes('ba', 'utf-16be')
b'x00bx00a'
The problem is that you can get some garbage if you decode as the wrong one. If you use UTF-16 with BOM, you’re more likely to get the right result when decoding.
You already got your answer I just wanted to explain it in my own words for future readers.
In UTF-16 encoding, It seems that 'a'
should occupy 16 bits or 2 bytes. The 'a'
itself needs 8 bits. The question is should I put the remaining zeroes before the value of 'a'
or after it? There are two possible ways:
First: 01100001|00000000
Second: 00000000|01100001
If I don’t tell you anything and just hand you these, this would happen:
First = b"0110000100000000"
print(hex(int(First, 2))) # 0x6100
print(chr(int(First, 2))) # 愀
Second = b"0000000001100001"
print(hex(int(Second, 2))) # 0x61
print(chr(int(Second, 2))) # a
So you can’t say anything just by looking at these bytes. Did I mean to send you 愀
or a
?
First Solution:
I myself tell you about this verbally. About the "Ordering"! Here is where "big-endian" and "little-endian" come into play:
bytes_ = b"ax00" # >>>>>> Please decode it with "Little-Endian"!
print(bytes_.decode("utf-16-le")) # a - Correct.
print(bytes_.decode("utf-16-be")) # 愀
So If I tell you about the endianness, you can get to the correct character.
You see, without any extra character we were able to achieve this.
Second Solution
I can "embed" the byte ordering into the bytes itself without explicitly telling you! It is called BOM(Byte Order Mark).
ordering1 = b"xfexff"
ordering2 = b"xffxfe"
print((ordering1 + b"x00a").decode("utf-16")) # a
print((ordering2 + b"ax00").decode("utf-16")) # a
Now just passing "utf-16"
to .decode()
is enough. It can figure the correct byte out correctly. There is no need to tell about le
or be
it’s already there.