file.tell() inconsistency

Question:

Does anybody happen to know why when you iterate over a file this way:

Input:

f = open('test.txt', 'r')
for line in f:
    print "f.tell(): ",f.tell()

Output:

f.tell(): 8192
f.tell(): 8192
f.tell(): 8192
f.tell(): 8192

I consistently get the wrong file index from tell(), however, if I use readline I get the appropriate index for tell():

Input:

f = open('test.txt', 'r')
while True:
    line = f.readline()
    if (line == ''):
        break
    print "f.tell(): ",f.tell()

Output:

f.tell(): 103
f.tell(): 107
f.tell(): 115
f.tell(): 124

I’m running python 2.7.1 BTW.

Asked By: nigp4w rudy

||

Answers:

Using open files as an iterator uses a read-ahead buffer to increase efficiency. As a result, the file pointer advances in large steps across the file as you loop over the lines.

From the File Objects documentation:

In order to make a for loop the most efficient way of looping over the lines of a file (a very common operation), the next() method uses a hidden read-ahead buffer. As a consequence of using a read-ahead buffer, combining next() with other file methods (like readline()) does not work right. However, using seek() to reposition the file to an absolute position will flush the read-ahead buffer.

If you need to rely on .tell(), don’t use the file object as an iterator. You can turn .readline() into an iterator instead (at the price of some performance loss):

for line in iter(f.readline, ''):
    print f.tell()

This uses the iter() function sentinel argument to turn any callable into an iterator.

Answered By: Martijn Pieters

The answer lies in the following part of Python 2.7 source code (fileobject.c):

#define READAHEAD_BUFSIZE 8192

static PyObject *
file_iternext(PyFileObject *f)
{
    PyStringObject* l;

    if (f->f_fp == NULL)
        return err_closed();
    if (!f->readable)
        return err_mode("reading");

    l = readahead_get_line_skip(f, 0, READAHEAD_BUFSIZE);
    if (l == NULL || PyString_GET_SIZE(l) == 0) {
        Py_XDECREF(l);
        return NULL;
    }
    return (PyObject *)l;
}

As you can see, file‘s iterator interface reads the file in blocks of 8KB. This explains why f.tell() behaves the way it does.

The documentation suggests it’s done for performance reasons (and does not guarantee any particular size of the readahead buffer).

Answered By: NPE

I experienced the same read-ahead buffer issue and solved it using Martijn’s suggestion.

I’ve since generalized my solution for anyone else looking to do such things:

https://github.com/loisaidasam/csv-position-reader

Happy CSV parsing!

Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.