Removing literal backslashes from utf-8 encoded strings in python

Question

I have a bunch of strings containing UTF-8 encoded symbols, for example '\u00f0\u009f\u0098\u0086'.
In that case, it represents this emoji , encoded in UTF-8. I want to be able to replace it to the literal emoji. The solution someone recommended to me was to encoded it into latin-1 and then decode it to utf-8. So,

'u00f0u009fu0098u0086'.encode('latin-1').decode('utf-8')

gives me the output

' '

Unfortunately, all the strings with those codes have a literal backslash into them, so whenever I to do the same operations,

'\u00f0\u009f\u0098\u0086'.encode('latin-1').decode('utf-8')

I get the following result,

'\u00f0\u009f\u0098\u0086'

Is there a way to remove those backslashes? Because if I replace them with an empty string, all backslashes disappear.

Asked By: Lucas Jofre

||

Source

Answer 1

I don’t know where you’re getting that string from, but it’s an…. unusual… way of representing the codepoint. U+1F606 SMILING FACE WITH OPEN MOUTH AND TIGHTLY-CLOSED EYES is encoded in UTF-8 as the bytes F0 9F 98 86. In Python string escapes, uXXXX is used to represent an entire codepoint in the Basic Multilingual Plane, and UXXXXXXXX codepoints beyond it (Like this one), not a single byte of its UTF-8 encoding. So you’d expect to see it represented in a string as 'U0001F606'

Anyways, the following will extract the last two hex digits of each escape sequence, turn them into a byte array, and then decode the resulting UTF-8 data into a string:

import re
str='\u00f0\u009f\u0098\u0086'
print(b''.join([ bytes.fromhex(m.group(1)) for m in re.finditer(r'\u[0-9a-fA-F]{2}([0-9a-fA-F]{2})', str) ]).decode())
# Displays

Answered By: Shawn

Answer 2

b'\u00f0\u009f\u0098\u0086' can be decoded directly by using encoding "unicode_escape".

For example:

>>> b'\u00f0\u009f\u0098\u0086'.decode("unicode_escape")
'ðx9fx98x86'

Although it seems different, it’s the same:

>>> b'\u00f0\u009f\u0098\u0086'.decode("unicode_escape") == 'u00f0u009fu0098u0086'
True

Beware that this will remove escaped backslashes on their own! For example, the following JSON will break:

>>> encoded_json = b'{"a":"Basic realm=\"Dost\udceapz"}'
>>> encoded_json.decode("unicode_escape")
'{"a":"Basic realm="Dostudceapz"}'
>>> json.loads(encoded_json)
{'a': 'Basic realm="Dostudceapz'}
>>> json.loads(encoded_json.decode("unicode_escape"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/javier/.pyenv/versions/3.6.15/lib/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/home/javier/.pyenv/versions/3.6.15/lib/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/home/javier/.pyenv/versions/3.6.15/lib/python3.6/json/decoder.py", line 355, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 20 (char 19)

Answered By: Javier

Removing literal backslashes from utf-8 encoded strings in python

Question:

Answers: