How to write UTF-8 in a CSV file
Question:
I am trying to create a text file in csv format out of a PyQt4 QTableWidget
. I want to write the text with a UTF-8 encoding because it contains special characters. I use following code:
import codecs
...
myfile = codecs.open(filename, 'w','utf-8')
...
f = result.table.item(i,c).text()
myfile.write(f+";")
It works until the cell contains a special character. I tried also with
myfile = open(filename, 'w')
...
f = unicode(result.table.item(i,c).text(), "utf-8")
But it also stops when a special character appears. I have no idea what I am doing wrong.
Answers:
The examples in the Python documentation show how to write Unicode CSV files: http://docs.python.org/2/library/csv.html#examples
(can’t copy the code here because it’s protected by copyright)
Use this package, it just works: https://github.com/jdunck/python-unicodecsv.
From your shell run:
pip2 install unicodecsv
And (unlike the original question) presuming you’re using Python’s built in csv
module, turn
import csv
into
import unicodecsv as csv
in your code.
A very simple hack is to use the json import instead of csv. For example instead of csv.writer just do the following:
fd = codecs.open(tempfilename, 'wb', 'utf-8')
for c in whatever :
fd.write( json.dumps(c) [1:-1] ) # json dumps writes ["a",..]
fd.write('n')
fd.close()
Basically, given the list of fields in correct order, the json formatted string is identical to a csv line except for [ and ] at the start and end respectively. And json seems to be robust to utf-8 in python 2.*
For me the UnicodeWriter
class from Python 2 CSV module documentation didn’t really work as it breaks the csv.writer.write_row()
interface.
For example:
csv_writer = csv.writer(csv_file)
row = ['The meaning', 42]
csv_writer.writerow(row)
works, while:
csv_writer = UnicodeWriter(csv_file)
row = ['The meaning', 42]
csv_writer.writerow(row)
will throw AttributeError: 'int' object has no attribute 'encode'
.
As UnicodeWriter
obviously expects all column values to be strings, we can convert the values ourselves and just use the default CSV module:
def to_utf8(lst):
return [unicode(elem).encode('utf-8') for elem in lst]
...
csv_writer.writerow(to_utf8(row))
Or we can even monkey-patch csv_writer to add a write_utf8_row
function – the exercise is left to the reader.
For python2 you can use this code before csv_writer.writerows(rows)
This code will NOT convert integers to utf-8 strings
def encode_rows_to_utf8(rows):
encoded_rows = []
for row in rows:
encoded_row = []
for value in row:
if isinstance(value, basestring):
value = unicode(value).encode("utf-8")
encoded_row.append(value)
encoded_rows.append(encoded_row)
return encoded_rows
I tried using Bojan‘s suggestion but it turned all the None cells into the word None rather than blank, and rendered floats as 1.231111111111111e+11, maybe other annoyances. Plus, I want my program to run under both Python3 and Python2. So, I ended up putting at the top of the program:
try:
csv.writer(open(os.devnull, 'w')).writerow([u'u03bc'])
PREPROCESS = lambda array: array
except UnicodeEncodeError:
logging.warning('csv module cannot handle unicode, patching...')
PREPROCESS = lambda array: [
item.encode('utf8')
if hasattr(item, 'encode') else item
for item in array
]
Then changed all csvout.writerow(row)
statements to csvout.writerow(PREPROCESS(row))
I could have used the test if sys.version_info < (3,):
instead of the try
statement but that violates "duck typing". I may revisit it and write that first one-liner properly with with
statements, to get rid of the dangling open file and writer
, but then I’d have to use ALL_CAPS variable names or pylint would complain… it should get garbage collected anyway, and in any case only lasts while the script is running.
I am trying to create a text file in csv format out of a PyQt4 QTableWidget
. I want to write the text with a UTF-8 encoding because it contains special characters. I use following code:
import codecs
...
myfile = codecs.open(filename, 'w','utf-8')
...
f = result.table.item(i,c).text()
myfile.write(f+";")
It works until the cell contains a special character. I tried also with
myfile = open(filename, 'w')
...
f = unicode(result.table.item(i,c).text(), "utf-8")
But it also stops when a special character appears. I have no idea what I am doing wrong.
The examples in the Python documentation show how to write Unicode CSV files: http://docs.python.org/2/library/csv.html#examples
(can’t copy the code here because it’s protected by copyright)
Use this package, it just works: https://github.com/jdunck/python-unicodecsv.
From your shell run:
pip2 install unicodecsv
And (unlike the original question) presuming you’re using Python’s built in csv
module, turn
import csv
into
import unicodecsv as csv
in your code.
A very simple hack is to use the json import instead of csv. For example instead of csv.writer just do the following:
fd = codecs.open(tempfilename, 'wb', 'utf-8')
for c in whatever :
fd.write( json.dumps(c) [1:-1] ) # json dumps writes ["a",..]
fd.write('n')
fd.close()
Basically, given the list of fields in correct order, the json formatted string is identical to a csv line except for [ and ] at the start and end respectively. And json seems to be robust to utf-8 in python 2.*
For me the UnicodeWriter
class from Python 2 CSV module documentation didn’t really work as it breaks the csv.writer.write_row()
interface.
For example:
csv_writer = csv.writer(csv_file)
row = ['The meaning', 42]
csv_writer.writerow(row)
works, while:
csv_writer = UnicodeWriter(csv_file)
row = ['The meaning', 42]
csv_writer.writerow(row)
will throw AttributeError: 'int' object has no attribute 'encode'
.
As UnicodeWriter
obviously expects all column values to be strings, we can convert the values ourselves and just use the default CSV module:
def to_utf8(lst):
return [unicode(elem).encode('utf-8') for elem in lst]
...
csv_writer.writerow(to_utf8(row))
Or we can even monkey-patch csv_writer to add a write_utf8_row
function – the exercise is left to the reader.
For python2 you can use this code before csv_writer.writerows(rows)
This code will NOT convert integers to utf-8 strings
def encode_rows_to_utf8(rows): encoded_rows = [] for row in rows: encoded_row = [] for value in row: if isinstance(value, basestring): value = unicode(value).encode("utf-8") encoded_row.append(value) encoded_rows.append(encoded_row) return encoded_rows
I tried using Bojan‘s suggestion but it turned all the None cells into the word None rather than blank, and rendered floats as 1.231111111111111e+11, maybe other annoyances. Plus, I want my program to run under both Python3 and Python2. So, I ended up putting at the top of the program:
try:
csv.writer(open(os.devnull, 'w')).writerow([u'u03bc'])
PREPROCESS = lambda array: array
except UnicodeEncodeError:
logging.warning('csv module cannot handle unicode, patching...')
PREPROCESS = lambda array: [
item.encode('utf8')
if hasattr(item, 'encode') else item
for item in array
]
Then changed all csvout.writerow(row)
statements to csvout.writerow(PREPROCESS(row))
I could have used the test if sys.version_info < (3,):
instead of the try
statement but that violates "duck typing". I may revisit it and write that first one-liner properly with with
statements, to get rid of the dangling open file and writer
, but then I’d have to use ALL_CAPS variable names or pylint would complain… it should get garbage collected anyway, and in any case only lasts while the script is running.