Python CSV DictReader with UTF-8 data

Question:

AFAIK, the Python (v2.6) csv module can’t handle unicode data by default, correct? In the Python docs there’s an example on how to read from a UTF-8 encoded file. But this example only returns the CSV rows as a list.
I’d like to access the row columns by name as it is done by csv.DictReader but with UTF-8 encoded CSV input file.

Can anyone tell me how to do this in an efficient way? I will have to process CSV files in 100’s of MByte in size.

Asked By: LMatter

||

Answers:

First of all, use the 2.6 version of the documentation. It can change for each release. It says clearly that it doesn’t support Unicode but it does support UTF-8. Technically, these are not the same thing. As the docs say:

The csv module doesn’t directly support reading and writing Unicode, but it is 8-bit-clean save for some problems with ASCII NUL characters. So you can write functions or classes that handle the encoding and decoding for you as long as you avoid encodings like UTF-16 that use NULs. UTF-8 is recommended.

The example below (from the docs) shows how to create two functions that correctly read text as UTF-8 as CSV. You should know that csv.reader() always returns a DictReader object.

import csv

def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
    # csv.py doesn't do Unicode; encode temporarily as UTF-8:
    csv_reader = csv.DictReader(utf_8_encoder(unicode_csv_data),
                            dialect=dialect, **kwargs)
    for row in csv_reader:
        # decode UTF-8 back to Unicode, cell by cell:
        yield [unicode(cell, 'utf-8') for cell in row]
Answered By: kelloti

I came up with an answer myself:

def UnicodeDictReader(utf8_data, **kwargs):
    csv_reader = csv.DictReader(utf8_data, **kwargs)
    for row in csv_reader:
        yield {unicode(key, 'utf-8'):unicode(value, 'utf-8') for key, value in row.iteritems()}

Note: This has been updated so keys are decoded per the suggestion in the comments

Answered By: LMatter

A classed based approach to @LMatter answer, with this approach you still get all the benefits of DictReader such as getting the fieldnames and getting the line number plus it handles UTF-8

import csv

class UnicodeDictReader(csv.DictReader, object):

    def next(self):
        row = super(UnicodeDictReader, self).next()
        return {unicode(key, 'utf-8'): unicode(value, 'utf-8') for key, value in row.iteritems()}
Answered By: Kasa

The csvw package has other functionality as well (for metadata-enriched CSV for the Web), but it defines a UnicodeDictReader class wrapping around its UnicodeReader class, which at its core does exactly that:

class UnicodeReader(Iterator):
    """Read Unicode data from a csv file."""
    […]

    def _next_row(self):
        self.lineno += 1
        return [
            s if isinstance(s, text_type) else s.decode(self._reader_encoding)
            for s in next(self.reader)]

It did catch me off a few times, but csvw.UnicodeDictReader really, really needs to be used in a with block and breaks otherwise. Other than that, the module is nicely generic and compatible with both py2 and py3.

Answered By: Anaphory

The answer doesn’t have the DictWriter methods, so here is the updated class:

class DictUnicodeWriter(object):

    def __init__(self, f, fieldnames, dialect=csv.excel, encoding="utf-8", **kwds):
        self.fieldnames = fieldnames    # list of keys for the dict
        # Redirect output to a queue
        self.queue = cStringIO.StringIO()
        self.writer = csv.DictWriter(self.queue, fieldnames, dialect=dialect, **kwds)
        self.stream = f
        self.encoder = codecs.getincrementalencoder(encoding)()

    def writerow(self, row):
        self.writer.writerow({k: v.encode("utf-8") for k, v in row.items()})
        # Fetch UTF-8 output from the queue ...
        data = self.queue.getvalue()
        data = data.decode("utf-8")
        # ... and reencode it into the target encoding
        data = self.encoder.encode(data)
        # write to the target stream
        self.stream.write(data)
        # empty queue
        self.queue.truncate(0)

    def writerows(self, rows):
        for row in rows:
            self.writerow(row)

    def writeheader(self):
        header = dict(zip(self.fieldnames, self.fieldnames))
        self.writerow(header)
Answered By: SePo

For me, the key was not in manipulating the csv DictReader args, but the file opener itself. This did the trick:

with open(filepath, mode="r", encoding="utf-8-sig") as csv_file:
    csv_reader = csv.DictReader(csv_file)

No special class required. Now I can open files either with or without BOM without crashing.

Answered By: shacker

That’s easy with the unicodecsv package.

# pip install unicodecsv
import unicodecsv as csv

with open('your_file.csv') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        print(row)
Answered By: mrvol
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.