What is the pythonic way to read CSV file data as rows of namedtuples?
Question:
What is the best way to take a data file that contains a header row and read this row into a named tuple so that the data rows can be accessed by header name?
I was attempting something like this:
import csv
from collections import namedtuple
with open('data_file.txt', mode="r") as infile:
reader = csv.reader(infile)
Data = namedtuple("Data", ", ".join(i for i in reader[0]))
next(reader)
for row in reader:
data = Data(*row)
The reader object is not subscriptable, so the above code throws a TypeError
. What is the pythonic way to reader a file header into a namedtuple?
Answers:
Use:
Data = namedtuple("Data", next(reader))
and omit the line:
next(reader)
Combining this with an iterative version based on martineau’s comment below, the example becomes for Python 2
import csv
from collections import namedtuple
from itertools import imap
with open("data_file.txt", mode="rb") as infile:
reader = csv.reader(infile)
Data = namedtuple("Data", next(reader)) # get names from column headers
for data in imap(Data._make, reader):
print data.foo
# ...further processing of a line...
and for Python 3
import csv
from collections import namedtuple
with open("data_file.txt", newline="") as infile:
reader = csv.reader(infile)
Data = namedtuple("Data", next(reader)) # get names from column headers
for data in map(Data._make, reader):
print(data.foo)
# ...further processing of a line...
Please have a look at csv.DictReader
. Basically, it provides the ability to get the column names from the first row as you’re looking for and, after that, lets you access to each column in a row by name using a dictionary.
If for some reason you still need to access the rows as a collections.namedtuple
, it should be easy to transform the dictionaries to named tuples as follows:
with open('data_file.txt') as infile:
reader = csv.DictReader(infile)
Data = collections.namedtuple('Data', reader.fieldnames)
tuples = [Data(**row) for row in reader]
I’d suggest this approach:
import csv
from collections import namedtuple
with open("data.csv", 'r') as f:
reader = csv.reader(f, delimiter=',')
Row = namedtuple('Row', next(reader))
rows = [Row(*line) for line in reader]
If you work with Pandas, the solution becomes even more elegant:
import pandas as pd
from collections import namedtuple
data = pd.read_csv("data.csv")
Row = namedtuple('Row', data.columns)
rows = [Row(*row) for index, row in data.iterrows()]
In both cases you can interact with the records by field names:
for row in rows:
print(row.foo)
What is the best way to take a data file that contains a header row and read this row into a named tuple so that the data rows can be accessed by header name?
I was attempting something like this:
import csv
from collections import namedtuple
with open('data_file.txt', mode="r") as infile:
reader = csv.reader(infile)
Data = namedtuple("Data", ", ".join(i for i in reader[0]))
next(reader)
for row in reader:
data = Data(*row)
The reader object is not subscriptable, so the above code throws a TypeError
. What is the pythonic way to reader a file header into a namedtuple?
Use:
Data = namedtuple("Data", next(reader))
and omit the line:
next(reader)
Combining this with an iterative version based on martineau’s comment below, the example becomes for Python 2
import csv
from collections import namedtuple
from itertools import imap
with open("data_file.txt", mode="rb") as infile:
reader = csv.reader(infile)
Data = namedtuple("Data", next(reader)) # get names from column headers
for data in imap(Data._make, reader):
print data.foo
# ...further processing of a line...
and for Python 3
import csv
from collections import namedtuple
with open("data_file.txt", newline="") as infile:
reader = csv.reader(infile)
Data = namedtuple("Data", next(reader)) # get names from column headers
for data in map(Data._make, reader):
print(data.foo)
# ...further processing of a line...
Please have a look at csv.DictReader
. Basically, it provides the ability to get the column names from the first row as you’re looking for and, after that, lets you access to each column in a row by name using a dictionary.
If for some reason you still need to access the rows as a collections.namedtuple
, it should be easy to transform the dictionaries to named tuples as follows:
with open('data_file.txt') as infile:
reader = csv.DictReader(infile)
Data = collections.namedtuple('Data', reader.fieldnames)
tuples = [Data(**row) for row in reader]
I’d suggest this approach:
import csv
from collections import namedtuple
with open("data.csv", 'r') as f:
reader = csv.reader(f, delimiter=',')
Row = namedtuple('Row', next(reader))
rows = [Row(*line) for line in reader]
If you work with Pandas, the solution becomes even more elegant:
import pandas as pd
from collections import namedtuple
data = pd.read_csv("data.csv")
Row = namedtuple('Row', data.columns)
rows = [Row(*row) for index, row in data.iterrows()]
In both cases you can interact with the records by field names:
for row in rows:
print(row.foo)