Most Pythonic way to read CSV values into dict of lists

Question:

I have a CSV file with headers at the top of columns of data as:

a,b,c
1,2,3
4,5,6
7,8,9

and I need to read it in a dict of lists:

desired_result = {'a': [1, 4, 7], 'b': [2, 5, 8], 'c': [3, 6, 9]}

When reading this with DictReader I am using a nested loop to append the items to the lists:

f = 'path_to_some_csv_file.csv'
dr = csv.DictReader(open(f))
dict_of_lists = dr.next()
for k in dict_of_lists.keys():
    dict_of_lists[k] = [dict_of_lists[k]]
for line in dr:
    for k in dict_of_lists.keys():
        dict_of_lists[k].append(line[k])

The first loop sets all values in the dict to the empty list. The next one loops over every line read in from the CSV file, from which DictReader creates a dict of key-values. The inner loop appends the value to list matching the corresponding key, so I wind up with the desired list of dicts. I end up having to write this fairly often.

My question is, is there a more Pythonic way of doing this using built-in functions without the nested loop, or a better idiom, or an alternative way to store this data structure such that I can return an indexable list by querying with a key? If so is there also a way to format the data being ingested by column upfront?

Asked By: mlh3789

||

Answers:

If you’re willing to use a third-party library, then the merge_with function from Toolz makes this whole operation a one-liner:

dict_of_lists = merge_with(list, *csv.DictReader(open(f)))

Using only the stdlib, a defaultdict makes the code less repetitive:

from collections import defaultdict
import csv

f = 'test.csv'

dict_of_lists = defaultdict(list)
for record in DictReader(open(f)):
    for key, val in record.items():    # or iteritems in Python 2
        dict_of_lists[key].append(val)

If you need to do this often, factor it out into a function, e.g. transpose_csv.

Answered By: Fred Foo

Depending on what type of data you’re storing and if you’re ok with using numpy, a good way to do this can be with numpy.genfromtxt:

import numpy as np
data = np.genfromtxt('data.csv', delimiter=',', names=True)

What this will do is create a numpy Structured Array, which provides a nice interface for querying the data by header name (make sure to use names=True if you have a header row).

Example, given data.csv containing:

a,b,c
1,2,3
4,5,6
7,8,9

You can then access elements with:

>>> data['a']        # Column with header 'a'
array([ 1.,  4.,  7.])
>>> data[0]          # First row
(1.0, 2.0, 3.0)
>>> data['c'][2]     # Specific element
9.0
>>> data[['a', 'c']] # Two columns
array([(1.0, 3.0), (4.0, 6.0), (7.0, 9.0)],
      dtype=[('a', '<f8'), ('c', '<f8')])

genfromtext also provides a way, as you requested, to "format the data being ingested by column up front."

converters : variable, optional

The set of functions that convert the data of a column to a value. The converters can also be used to provide a default value for missing data: converters = {3: lambda s: float(s or 0)}.

Answered By: ford

You can use dict and set comprehensions to make your intent more obvious:

dr=csv.DictReader(f)
data={k:[v] for k, v in dr.next().items()}             # create the initial dict of lists
for line_dict in dr:
    {data[k].append(v) for k, v in line_dict.items()}  # append to each

You can use Alex Martelli’s method to flatten a list of lists in Python to flatten an iterator of iterators, which further reduces the first form to:

dr=csv.DictReader(f)
data={k:[v] for k, v in dr.next().items()}
{data[k].append(v) for line_dict in dr for k, v in line_dict.items()}

On Python 2.X, consider using {}.iteritems vs {}.items() if your csv file is sizable.


Further example:

Assume this csv file:

Header 1,Header 2,Header 3
1,2,3
4,5,6
7,8,9

Now suppose you want a dict of lists of each value converted to a float or int. You can do:

def convert(s, converter):
    try:
        return converter(s)
    except Exception:
        return s    

dr=csv.DictReader(f)
data={k:[convert(v, float)] for k, v in dr.next().items()}
{data[k].append(convert(v, float)) for line_dict in dr for k, v in line_dict.items()}

print data
# {'Header 3': [3.0, 6.0, 9.0], 'Header 2': [2.0, 5.0, 8.0], 'Header 1': [1.0, 4.0, 7.0]}
Answered By: dawg

Nothing wrong with ford’s answer, I’ll just add mine here (which makes use of the csv library)

with open(f,'r',encoding='latin1') as csvf:
    dialect = csv.Sniffer().sniff(csvf.readline()) # finds the delimiters automatically
    csvf.seek(0)
    # read file with dialect
    rdlistcsv = csv.reader(csvf,dialect)
    # save to list of rows
    rowslist  = [list(filter(None,line)) for line in rdlistcsv]
    header = rowslist[0]
    data = {}
    for i,key in enumerate(header):
        ilist = [row[i] for row in rowslist]
        data.update({key: ilist})

EDIT: actually, if you do not mind using pandas things get way easier with it:

  1. import pandas

    import pandas as pd
    
  2. import file and save it as pandas dataframe

    df = pd.read_csv(inputfile)
    
  3. turn df into a dictionary

    mydict = df.to_dict(orient='list')
    

This way you use the csv header to define the keys and for each key you have a list of elements (something like an excel column turned to a list)

Answered By: brodegon
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.