Python 3 : Converting UTF-8 unicode Hindi Literal to Unicode
Question:
I have a string of UTF-8 literals
‘xe0xa4xb9xe0xa5x80 xe0xa4xacxe0xa5x8bxe0xa4xb2’ which covnverts to
ही बोल in Hindi. I am unable convert string a
to bytes
a = 'xe0xa4xb9xe0xa5x80 xe0xa4xacxe0xa5x8bxe0xa4xb2'
#convert a to bytes
#also tried a = bytes(a,'utf-8')
a = a.encode('utf-8')
s = str(a,'utf-8')
The string is converted to bytes but contains wrong unicode literals
RESULT : b'xc3xa0xc2xa4xc2xb9xc3xa0xc2xa5xc2x80 xc3xa0xc2xa4xc2xacxc3xa0xc2xa5xc2x8bxc3xa0xc2xa4xc2xb2'
which prints – हॠबà¥à¤²
EXPECTED : It should be b'xe0xa4xb9xe0xa5x80xe0xa4xacxe0xa5x8bxe0xa4xb2
which will be ही बोल
Answers:
Use the raw-unicode-escape codec to encode the string as bytes, then you can decode as UTF-8.
>>> s = 'xe0xa4xb9xe0xa5x80 xe0xa4xacxe0xa5x8bxe0xa4xb2'
>>> s.encode('raw-unicode-escape').decode('utf-8')
'ही बोल'
This is something of a workaround; the ideal solution would be to prevent the source of the data stringifying the original bytes.
Your original string was likely decoded as latin1
. Decode it as UTF-8 instead if possible, but if received messed up you can reverse it by encoding as latin1
again and decoding correctly as UTF-8:
>>> s = 'xe0xa4xb9xe0xa5x80 xe0xa4xacxe0xa5x8bxe0xa4xb2'
>>> s.encode('latin1').decode('utf8')
'ही बोल'
Note that latin1
encoding matches the first 256 Unicode code points, so U+00E0
('xe0'
in a Python 3 str
object) becomes byte E0h (b'xe0'
in a Python 3 bytes
object). It’s a 1:1 mapping between U+0000-U+00FF and bytes 00h-FFh.
After using above code ,You are still facing error to encode your string.
Use sys module to encode the string .
- Use this code –
-
import sys
-
sys.stdout.reconfigure(encoding='utf8',errors='backslashreplace')
-
s='xe0xa4xb9xe0xa5x80 xe0xa4xacxe0xa5x8bxe0xa4xb2'
-
print(s.encode('raw-unicode-escape').decode('utf-8'))
-
or print(txt.encode().decode())
I have a string of UTF-8 literals
‘xe0xa4xb9xe0xa5x80 xe0xa4xacxe0xa5x8bxe0xa4xb2’ which covnverts to
ही बोल in Hindi. I am unable convert string a
to bytes
a = 'xe0xa4xb9xe0xa5x80 xe0xa4xacxe0xa5x8bxe0xa4xb2'
#convert a to bytes
#also tried a = bytes(a,'utf-8')
a = a.encode('utf-8')
s = str(a,'utf-8')
The string is converted to bytes but contains wrong unicode literals
RESULT : b'xc3xa0xc2xa4xc2xb9xc3xa0xc2xa5xc2x80 xc3xa0xc2xa4xc2xacxc3xa0xc2xa5xc2x8bxc3xa0xc2xa4xc2xb2'
which prints – हॠबà¥à¤²
EXPECTED : It should be b'xe0xa4xb9xe0xa5x80xe0xa4xacxe0xa5x8bxe0xa4xb2
which will be ही बोल
Use the raw-unicode-escape codec to encode the string as bytes, then you can decode as UTF-8.
>>> s = 'xe0xa4xb9xe0xa5x80 xe0xa4xacxe0xa5x8bxe0xa4xb2'
>>> s.encode('raw-unicode-escape').decode('utf-8')
'ही बोल'
This is something of a workaround; the ideal solution would be to prevent the source of the data stringifying the original bytes.
Your original string was likely decoded as latin1
. Decode it as UTF-8 instead if possible, but if received messed up you can reverse it by encoding as latin1
again and decoding correctly as UTF-8:
>>> s = 'xe0xa4xb9xe0xa5x80 xe0xa4xacxe0xa5x8bxe0xa4xb2'
>>> s.encode('latin1').decode('utf8')
'ही बोल'
Note that latin1
encoding matches the first 256 Unicode code points, so U+00E0
('xe0'
in a Python 3 str
object) becomes byte E0h (b'xe0'
in a Python 3 bytes
object). It’s a 1:1 mapping between U+0000-U+00FF and bytes 00h-FFh.
After using above code ,You are still facing error to encode your string.
Use sys module to encode the string .
- Use this code –
-
import sys
-
sys.stdout.reconfigure(encoding='utf8',errors='backslashreplace')
-
s='xe0xa4xb9xe0xa5x80 xe0xa4xacxe0xa5x8bxe0xa4xb2'
-
print(s.encode('raw-unicode-escape').decode('utf-8'))
-
or print(txt.encode().decode())