Number of lines in csv.DictReader

Question:

I have a csv DictReader object (using Python 3.1), but I would like to know the number of lines/rows contained in the reader before I iterate through it. Something like as follows…

myreader = csv.DictReader(open('myFile.csv', newline=''))

totalrows = ?

rowcount = 0
for row in myreader:
    rowcount +=1
    print("Row %d/%d" % (rowcount,totalrows))

I know I could get the total by iterating through the reader, but then I couldn’t run the ‘for’ loop. I could iterate through a copy of the reader, but I cannot find how to copy an iterator.

I could also use

totalrows = len(open('myFile.csv').readlines())

but that seems an unnecessary re-opening of the file. I would rather get the count from the DictReader if possible.

Any help would be appreciated.

Alan

Asked By: Alan Harris-Reid

||

Answers:

rows = list(myreader)
totalrows = len(rows)
for i, row in enumerate(rows):
    print("Row %d/%d" % (i+1, totalrows))
Answered By: jfs

I cannot find how to copy an
iterator.

Closest is itertools.tee, but simply making a list of it, as @J.F.Sebastian suggests, is best here, as itertools.tee’s docs explain:

This itertool may require significant
auxiliary storage (depending on how
much temporary data needs to be
stored). In general, if one iterator
uses most or all of the data before
another iterator starts, it is faster
to use list() instead of tee().

Answered By: Alex Martelli

You only need to open the file once:

import csv

f = open('myFile.csv', 'rb')

countrdr = csv.DictReader(f)
totalrows = 0
for row in countrdr:
  totalrows += 1

f.seek(0)  # You may not have to do this, I didn't check to see if DictReader did

myreader = csv.DictReader(f)
for row in myreader:
  do_work

No matter what you do you have to make two passes (well, if your records are a fixed length – which is unlikely – you could just get the file size and divide, but lets presume that isn’t the case). Opening the file again really doesn’t cost you much, but you can avoid it as illustrated here. Converting to a list just to use len() is potentially going to waste tons of memory, and not be any faster.

Note: The ‘Pythonic’ way is to use enumerate instead of +=, but the UNPACK_TUPLE opcode is so expensive that it makes enumerate slower than incrementing a local. That being said, it’s likely an unnecessary micro-optimization that you should probably avoid.

More Notes: If you really just want to generate some kind of progress indicator, it doesn’t necessarily have to be record based. You can tell() on the file object in the loop and just report what % of the data you’re through. It’ll be a little uneven, but chances are on any file that’s large enough to warrant a progress bar the deviation on record length will be lost in the noise.

Answered By: Nick Bastin

As mentioned in the answer https://stackoverflow.com/a/2890569/8056572 you can get the number of lines by taking the length of the reader converted to a list. However, this will have an impact on the RAM consumption and you will loose the benefits of the reader (which is a generator).

The best solution in my opinion is to open the file 2 times:

  1. count the number of lines:
total_rows = sum(1 for _ in open('myFile.csv')) # -1 if you want to remove the header from the count

Note: I am not using .readlines() to avoid to load all the lines in memory

  1. iterate over the lines

According to your snippet you will have something like this:

import csv

totalrows = sum(1 for _ in open('myFile.csv'))

myreader = csv.DictReader(open('myFile.csv'))

for i, _ in enumerate(myreader, start=1):
    print("Row %d/%d" % (i, totalrows))

Note: the start=1 in the enumerate indicates the first value of i. By default it is 0, if you keep this default value you have to use i + 1 in the print statement


If you really do not want to open the file two times you can use seek as mentioned in the answer https://stackoverflow.com/a/2891061/8056572

import csv

f = open('myFile.csv')

total_rows = sum(1 for _ in f)

f.seek(0)

myreader = csv.DictReader(f)

for i, _ in enumerate(myreader, start=1):
    print("Row %d/%d" % (i, totalrows))
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.