Unicode encoding for Polish characters in Python
Question:
I am having a Polish artist name as follows:
Żółte słonie
In my dataset (json file), it has been encoded as:
u017bu00f3u0142te Su0142onie
I am reading the json and doing some pre-processing and writing the output to a text file. I get the following error:
UnicodeEncodeError: 'charmap' codec can't encode character u'u017b' in position 0: character maps to <undefined>
I looked up the Unicode encoding for Polish characters online and the encoding looks fine to me. Since I have never worked with anything other than LATIN before, I wanted to confirm this with the SO community. If the encoding is right, then why is Python not handling it?
Thanks,
TM
Answers:
I have made simple test with Python 2.7 and it seems that json
changes type of object from str
to unicode
. So you have to encode()
such string before writing it to text file.
#!/usr/bin/env python
# -*- coding: utf8 -*-
import json
s = 'Żółte słonie'
print(type(s))
print(repr(s))
sd = json.dumps(s)
print(repr(sd))
s2 = json.loads(sd)
print(type(s2))
print(repr(s2))
f = open('out.txt', 'w')
try:
f.write(s2)
except UnicodeEncodeError:
print('UnicodeEncodeError, encoding data...')
f.write(s2.encode('UTF8'))
print('data encoded and saved')
f.close()
I have faced the same problem, and my solution is a simple change in json.dump method call. I have changed ensure_ascii to False (it is True by default). My method:
def save_to_file(self):
year, month = self.get_month_year_string()
filename = "./files/project-" + year + "-" + month + ".json"
with open(filename, "w", encoding="utf-8") as file:
# file.write(self.project)
json.dump(self.project, file, ensure_ascii=False)
I am having a Polish artist name as follows:
Żółte słonie
In my dataset (json file), it has been encoded as:
u017bu00f3u0142te Su0142onie
I am reading the json and doing some pre-processing and writing the output to a text file. I get the following error:
UnicodeEncodeError: 'charmap' codec can't encode character u'u017b' in position 0: character maps to <undefined>
I looked up the Unicode encoding for Polish characters online and the encoding looks fine to me. Since I have never worked with anything other than LATIN before, I wanted to confirm this with the SO community. If the encoding is right, then why is Python not handling it?
Thanks,
TM
I have made simple test with Python 2.7 and it seems that json
changes type of object from str
to unicode
. So you have to encode()
such string before writing it to text file.
#!/usr/bin/env python
# -*- coding: utf8 -*-
import json
s = 'Żółte słonie'
print(type(s))
print(repr(s))
sd = json.dumps(s)
print(repr(sd))
s2 = json.loads(sd)
print(type(s2))
print(repr(s2))
f = open('out.txt', 'w')
try:
f.write(s2)
except UnicodeEncodeError:
print('UnicodeEncodeError, encoding data...')
f.write(s2.encode('UTF8'))
print('data encoded and saved')
f.close()
I have faced the same problem, and my solution is a simple change in json.dump method call. I have changed ensure_ascii to False (it is True by default). My method:
def save_to_file(self):
year, month = self.get_month_year_string()
filename = "./files/project-" + year + "-" + month + ".json"
with open(filename, "w", encoding="utf-8") as file:
# file.write(self.project)
json.dump(self.project, file, ensure_ascii=False)