How to decode escaped Unicode characters?

Question:

I’m trying to replace escaped Unicode characters with the actual characters:

string = "\u00c3\u00a4"
print(string.encode().decode("unicode-escape"))

The expected output is ä, the actual output is ä.

Asked By: Toast

||

Answers:

The following solution seems to work in similar situations (see for example this case about decoding broken Hebrew text):

("\u00c3\u00a4"
  .encode('latin-1')
  .decode('unicode_escape')
  .encode('latin-1')
  .decode('utf-8')
)

Outputs:

'ä'

This works as follows:

  • The string that contains only ascii-characters '', 'u', '0', '0', 'c', etc. is converted to bytes using some not-too-crazy 8-bit encoding (doesn’t really matter which one, as long as it treats ASCII characters properly)
  • Use a decoder that interprets the 'u00c3' escapes as unicode code point U+00C3 (LATIN CAPITAL LETTER A WITH TILDE, ‘Ã’). From the point of view of your code, it’s nonsense, but this unicode code point has the right byte representation when again encoded with ISO-8859-1/'latin-1', so…
  • encode it again with 'latin-1'
  • Decode it "properly" this time, as UTF-8

Again, same remark as in the linked post: before investing too much energy trying to repair the broken text, you might want to try to repair the part of the code that is doing the encoding in such a strange way. Not breaking it in the first place is better than breaking it and then repairing it again.

Answered By: Andrey Tyukin

The codecs doc page states:

enter image description here

That means that output of the "unicode-escape" will be latin1, even if the default for python is utf-8.
So, you just need to encode back to latin1 and decode back to utf-8

mixed_string_to_be_unescaped =  'u002Fq:85\u002FczM"},{"name":"Santé","parent_name":"Santé'

val = codecs.decode(mixed_string_to_be_unescaped, 'unicode-escape')
val = val.encode('latin1').decode('utf-8')
print(val)

/q:85/czM"},{"name":"Santé","parent_name":"Santé

The above solution works, but to me was not clear because I didn’t get why I should convert to latin-1 before the unicode_escape (discovered that was doing this automatically), neither why it was using unicode_escape in an unescaped string.

Answered By: Daniele Rugginenti

I’ve spent a good few moments to understand this, so sharing here for potential future readers.

This is one of promoted questions re: decodeing espaced Unicode characters, but this is very special situation. The original string here has been created in a strange way, probably after encoding and decoding several times.
The final output is just one character, that has Unicode code point u00E4.
If it was stored in the file as ‘u00E4’, that could be converted using
"u00E4".encode(‘latin-1’).decode(‘unicode_escape’)

But here, it’s utf-8 code point – 2 bytes and these two bytes are represented as a sequence of escaped Unicode characters.

Answered By: MkL