Unicode encoding for Polish characters in Python

Question:

I am having a Polish artist name as follows:

Żółte słonie

In my dataset (json file), it has been encoded as:

u017bu00f3u0142te Su0142onie

I am reading the json and doing some pre-processing and writing the output to a text file. I get the following error:

UnicodeEncodeError: 'charmap' codec can't encode character u'u017b' in position 0: character maps to <undefined>

I looked up the Unicode encoding for Polish characters online and the encoding looks fine to me. Since I have never worked with anything other than LATIN before, I wanted to confirm this with the SO community. If the encoding is right, then why is Python not handling it?

Thanks,
TM

Asked By: visakh

||

Answers:

I have made simple test with Python 2.7 and it seems that json changes type of object from str to unicode. So you have to encode() such string before writing it to text file.

#!/usr/bin/env python
# -*- coding: utf8 -*-

import json

s = 'Żółte słonie'
print(type(s))
print(repr(s))
sd = json.dumps(s)
print(repr(sd))
s2 = json.loads(sd)
print(type(s2))
print(repr(s2))

f = open('out.txt', 'w')
try:
    f.write(s2)
except UnicodeEncodeError:
    print('UnicodeEncodeError, encoding data...')
    f.write(s2.encode('UTF8'))
    print('data encoded and saved')
f.close()
Answered By: Michał Niklas

I have faced the same problem, and my solution is a simple change in json.dump method call. I have changed ensure_ascii to False (it is True by default). My method:

def save_to_file(self):
    year, month = self.get_month_year_string()
    filename = "./files/project-" + year + "-" + month + ".json"
    with open(filename, "w", encoding="utf-8") as file:
        # file.write(self.project)
        json.dump(self.project, file, ensure_ascii=False)
Answered By: Mariusz K.
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.