UTF-8 characters in python string even after decoding from UTF-8?
Question:
I’m working on converting portions of XHTML to JSON objects. I finally got everything in JSON form, but some UTF-8 character codes are being printed.
Example:
{
"p": {
"@class": "para-p",
"#text": "Iu2019m not on Earth."
}
}
This should be:
{
"p": {
"@class": "para-p",
"#text": "I'm not on Earth."
}
}
This is just one example of UTF-8 codes coming through. How can I got through the string and replace every instance of a UTF-8 code with the character it represents?
Answers:
u2019
is not a UTF-8 character, but a Unicode escape code. It’s valid JSON and when read back via json.load
will become ’
(RIGHT SINGLE QUOTATION MARK).
If you want to write the actual character, use ensure_ascii=False
to prevent escape codes from being written for non-ASCII characters:
with open('output.json','w',encoding='utf8') as f:
json.dump(data, f, ensure_ascii=False, indent=2)
You didn’T paste your code, so I don’t kwon how you converted XHTML to JSON. I assume that you ended with hex value characters in Python objects. This u2019
is a single character with a 16-bit hex value. The JSON module can handle this by default. For example, the json.loads
method can fix that:
x = '''{
"p": {
"@class": "para-p",
"#text": "I\u2019m not on Earth."
}
}'''
print(x)
x_json=json.loads(x)
print(x_json)
Output shows:
{
"p": {
"@class": "para-p",
"#text": "Iu2019m not on Earth."
}
}
{'p': {'@class': 'para-p', '#text': 'I’m not on Earth.'}}
I’m working on converting portions of XHTML to JSON objects. I finally got everything in JSON form, but some UTF-8 character codes are being printed.
Example:
{
"p": {
"@class": "para-p",
"#text": "Iu2019m not on Earth."
}
}
This should be:
{
"p": {
"@class": "para-p",
"#text": "I'm not on Earth."
}
}
This is just one example of UTF-8 codes coming through. How can I got through the string and replace every instance of a UTF-8 code with the character it represents?
u2019
is not a UTF-8 character, but a Unicode escape code. It’s valid JSON and when read back via json.load
will become ’
(RIGHT SINGLE QUOTATION MARK).
If you want to write the actual character, use ensure_ascii=False
to prevent escape codes from being written for non-ASCII characters:
with open('output.json','w',encoding='utf8') as f:
json.dump(data, f, ensure_ascii=False, indent=2)
You didn’T paste your code, so I don’t kwon how you converted XHTML to JSON. I assume that you ended with hex value characters in Python objects. This u2019
is a single character with a 16-bit hex value. The JSON module can handle this by default. For example, the json.loads
method can fix that:
x = '''{
"p": {
"@class": "para-p",
"#text": "I\u2019m not on Earth."
}
}'''
print(x)
x_json=json.loads(x)
print(x_json)
Output shows:
{
"p": {
"@class": "para-p",
"#text": "Iu2019m not on Earth."
}
}
{'p': {'@class': 'para-p', '#text': 'I’m not on Earth.'}}