Preserving special characters when writing to a CSV – What encoding to use?

Question:

I am trying to save the string the United Nations’ Sustainable Development Goals (SDGs) into a csv.

If I use utf-8 as the encoding, the apostrophe in the string gets converted to an ASCII char

import csv
str_ = "the United Nations’ Sustainable Development Goals (SDGs)"

#write to a csv file
with open("output.csv", 'w', newline='', encoding='utf-8') as file:
    csv_writer = csv.writer(file,delimiter=",")
    csv_writer.writerow([str_])

#read from the csv file created above
with open("output.csv",newline='') as file:
    csv_reader = csv.reader(file)

    for row in csv_reader:
        print(row)

The result I get is
['the United Nations’ Sustainable Development Goals (SDGs)']

If I use cp1252 as the encoding, the apostrophe in the string is preserved as you can see in the result

import csv
str_ = "the United Nations’ Sustainable Development Goals (SDGs)"

#write to a csv file
with open("output.csv", 'w', newline='', encoding='cp1252') as file:
    csv_writer = csv.writer(file,delimiter=",")
    csv_writer.writerow([str_])

#read from the csv file created above
with open("output.csv",newline='') as file:
    csv_reader = csv.reader(file)

    for row in csv_reader:
        print(row)

The result I get is
['the United Nations' Sustainable Development Goals (SDGs)'] , which is ideal and

What encoding should I ideally be using if I want to preserve the special characters ? Is there a benefit of using utf-8 over cp1252?

My use case is to feed lines in the CSV to a language model(GPT) and hence I want the text to be "English" / Unchanged..

I am using Python 3.8 on Windows 11

Asked By: newbie101

||

Answers:

with open("output.csv", 'w', newline='', encoding='utf-8') as file:
    ...

with open("output.csv",newline='') as file:
    ...

The problem is simply that you’re explicitly, correctly writing UTF-8 to the file, but then open it for reading in some undefined implicit encoding, which in your case defaults to not UTF-8. Thus you’re reading it wrong.

Also include the encoding when reading the file, and all is good:

with open('output.csv', newline='', encoding='utf-8') as file:

You should use UTF-8 as encoding, as it can encode all possible characters. Most other encodings can only encode some subset of all possible characters. You’d need to have a good reason to use another encoding. If you have a particular target in mind (e.g. Excel) and you know what encoding that target prefers, then use that. Otherwise UTF-8 is a sane default.

Answered By: deceze
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.