Removing literal backslashes from utf-8 encoded strings in python
Question:
I have a bunch of strings containing UTF-8 encoded symbols, for example '\u00f0\u009f\u0098\u0086'
.
In that case, it represents this emoji
, encoded in UTF-8. I want to be able to replace it to the literal emoji. The solution someone recommended to me was to encoded it into latin-1
and then decode it to utf-8
. So,
'u00f0u009fu0098u0086'.encode('latin-1').decode('utf-8')
gives me the output
' '
Unfortunately, all the strings with those codes have a literal backslash into them, so whenever I to do the same operations,
'\u00f0\u009f\u0098\u0086'.encode('latin-1').decode('utf-8')
I get the following result,
'\u00f0\u009f\u0098\u0086'
Is there a way to remove those backslashes? Because if I replace them with an empty string, all backslashes disappear.
Answers:
I don’t know where you’re getting that string from, but it’s an…. unusual… way of representing the codepoint. U+1F606 SMILING FACE WITH OPEN MOUTH AND TIGHTLY-CLOSED EYES is encoded in UTF-8 as the bytes F0 9F 98 86
. In Python string escapes, uXXXX
is used to represent an entire codepoint in the Basic Multilingual Plane, and UXXXXXXXX
codepoints beyond it (Like this one), not a single byte of its UTF-8 encoding. So you’d expect to see it represented in a string as 'U0001F606'
Anyways, the following will extract the last two hex digits of each escape sequence, turn them into a byte array, and then decode the resulting UTF-8 data into a string:
import re
str='\u00f0\u009f\u0098\u0086'
print(b''.join([ bytes.fromhex(m.group(1)) for m in re.finditer(r'\u[0-9a-fA-F]{2}([0-9a-fA-F]{2})', str) ]).decode())
# Displays
b'\u00f0\u009f\u0098\u0086'
can be decoded directly by using encoding "unicode_escape".
For example:
>>> b'\u00f0\u009f\u0098\u0086'.decode("unicode_escape")
'ðx9fx98x86'
Although it seems different, it’s the same:
>>> b'\u00f0\u009f\u0098\u0086'.decode("unicode_escape") == 'u00f0u009fu0098u0086'
True
Beware that this will remove escaped backslashes on their own! For example, the following JSON will break:
>>> encoded_json = b'{"a":"Basic realm=\"Dost\udceapz"}'
>>> encoded_json.decode("unicode_escape")
'{"a":"Basic realm="Dostudceapz"}'
>>> json.loads(encoded_json)
{'a': 'Basic realm="Dostudceapz'}
>>> json.loads(encoded_json.decode("unicode_escape"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/javier/.pyenv/versions/3.6.15/lib/python3.6/json/__init__.py", line 354, in loads
return _default_decoder.decode(s)
File "/home/javier/.pyenv/versions/3.6.15/lib/python3.6/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/home/javier/.pyenv/versions/3.6.15/lib/python3.6/json/decoder.py", line 355, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 20 (char 19)
I have a bunch of strings containing UTF-8 encoded symbols, for example '\u00f0\u009f\u0098\u0086'
.
In that case, it represents this emoji
, encoded in UTF-8. I want to be able to replace it to the literal emoji. The solution someone recommended to me was to encoded it into latin-1
and then decode it to utf-8
. So,
'u00f0u009fu0098u0086'.encode('latin-1').decode('utf-8')
gives me the output
' '
Unfortunately, all the strings with those codes have a literal backslash into them, so whenever I to do the same operations,
'\u00f0\u009f\u0098\u0086'.encode('latin-1').decode('utf-8')
I get the following result,
'\u00f0\u009f\u0098\u0086'
Is there a way to remove those backslashes? Because if I replace them with an empty string, all backslashes disappear.
I don’t know where you’re getting that string from, but it’s an…. unusual… way of representing the codepoint. U+1F606 SMILING FACE WITH OPEN MOUTH AND TIGHTLY-CLOSED EYES is encoded in UTF-8 as the bytes F0 9F 98 86
. In Python string escapes, uXXXX
is used to represent an entire codepoint in the Basic Multilingual Plane, and UXXXXXXXX
codepoints beyond it (Like this one), not a single byte of its UTF-8 encoding. So you’d expect to see it represented in a string as 'U0001F606'
Anyways, the following will extract the last two hex digits of each escape sequence, turn them into a byte array, and then decode the resulting UTF-8 data into a string:
import re
str='\u00f0\u009f\u0098\u0086'
print(b''.join([ bytes.fromhex(m.group(1)) for m in re.finditer(r'\u[0-9a-fA-F]{2}([0-9a-fA-F]{2})', str) ]).decode())
# Displays
b'\u00f0\u009f\u0098\u0086'
can be decoded directly by using encoding "unicode_escape".
For example:
>>> b'\u00f0\u009f\u0098\u0086'.decode("unicode_escape")
'ðx9fx98x86'
Although it seems different, it’s the same:
>>> b'\u00f0\u009f\u0098\u0086'.decode("unicode_escape") == 'u00f0u009fu0098u0086'
True
Beware that this will remove escaped backslashes on their own! For example, the following JSON will break:
>>> encoded_json = b'{"a":"Basic realm=\"Dost\udceapz"}'
>>> encoded_json.decode("unicode_escape")
'{"a":"Basic realm="Dostudceapz"}'
>>> json.loads(encoded_json)
{'a': 'Basic realm="Dostudceapz'}
>>> json.loads(encoded_json.decode("unicode_escape"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/javier/.pyenv/versions/3.6.15/lib/python3.6/json/__init__.py", line 354, in loads
return _default_decoder.decode(s)
File "/home/javier/.pyenv/versions/3.6.15/lib/python3.6/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/home/javier/.pyenv/versions/3.6.15/lib/python3.6/json/decoder.py", line 355, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 20 (char 19)