how to decode an ascii string with backslash x x codes

Question:

I am trying to decode from a Brazilian Portogese text:

‘Demais Subfunxc3xa7xc3xb5es 12’

It should be

‘Demais Subfunções 12’

>> a.decode('unicode_escape')
>> a.encode('unicode_escape')
>> a.decode('ascii')
>> a.encode('ascii')

all give:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 13:
ordinal not in range(128)

on the other hand this gives:

>> print a.encode('utf-8')
Demais Subfun├â┬º├â┬Áes 12

>> print a
Demais Subfunções 12

Answers:

You have binary data that is not ASCII encoded. The xhh codepoints indicate your data is encoded with a different codec, and you are seeing Python produce a representation of the data using the repr() function that can be re-used as a Python literal that accurately lets you re-create the exact same value. This representation is very useful when debugging a program.

In other words, the xhh escape sequences represent individual bytes, and the hh is the hex value of that byte. You have 4 bytes with hex values C3, A7, C3 and B5, that do not map to printable ASCII characters so Python uses the xhh notation instead.

You instead have UTF-8 data, decode it as such:

>>> 'Demais Subfunxc3xa7xc3xb5es 12'.decode('utf8')
u'Demais Subfunxe7xf5es 12'
>>> print 'Demais Subfunxc3xa7xc3xb5es 12'.decode('utf8')
Demais Subfunções 12

The C3 A7 bytes together encode U+00E7 LATIN SMALL LETTER C WITH CEDILLA, while the C3 B5 bytes encode U+00F5 LATIN SMALL LETTER O WITH TILDE.

ASCII happens to be a subset of the UTF-8 codec, which is why all the other letters can be represented as such in the Python repr() output.

Answered By: Martijn Pieters

for Python 3: add b prefix, means bytes, then could use decode

>>> b"xe4xb8x8bxe4xb8x80xe6xadxa5".decode("utf-8")
'下一步'

otherwise will error:

>>> "xe4xb8x8bxe4xb8x80xe6xadxa5".decode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'decode'
Answered By: crifan
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.