How should I read a file line-by-line in Python?

Question:

In pre-historic times (Python 1.4) we did:

fp = open('filename.txt')
while 1:
    line = fp.readline()
    if not line:
        break
    print(line)

after Python 2.1, we did:

for line in open('filename.txt').xreadlines():
    print(line)

before we got the convenient iterator protocol in Python 2.3, and could do:

for line in open('filename.txt'):
    print(line)

I’ve seen some examples using the more verbose:

with open('filename.txt') as fp:
    for line in fp:
        print(line)

is this the preferred method going forwards?

[edit] I get that the with statement ensures closing of the file… but why isn’t that included in the iterator protocol for file objects?

Asked By: thebjorn

||

Answers:

Yes,

with open('filename.txt') as fp:
    for line in fp:
        print(line)

is the way to go.

It is not more verbose. It is more safe.

Answered By: eumiro

There is exactly one reason why the following is preferred:

with open('filename.txt') as fp:
    for line in fp:
        print(line)

We are all spoiled by CPython’s relatively deterministic reference-counting scheme for garbage collection. Other, hypothetical implementations of Python will not necessarily close the file "quickly enough" without the with block if they use some other scheme to reclaim memory.

In such an implementation, you might get a "too many files open" error from the OS if your code opens files faster than the garbage collector calls finalizers on orphaned file handles. The usual workaround is to trigger the GC immediately, but this is a nasty hack and it has to be done by every function that could encounter the error, including those in libraries. What a nightmare.

Or you could just use the with block.

Bonus Question

(Stop reading now if are only interested in the objective aspects of the question.)

Why isn’t that included in the iterator protocol for file objects?

This is a subjective question about API design, so I have a subjective answer in two parts.

On a gut level, this feels wrong, because it makes iterator protocol do two separate things—iterate over lines and close the file handle—and it’s often a bad idea to make a simple-looking function do two actions. In this case, it feels especially bad because iterators relate in a quasi-functional, value-based way to the contents of a file, but managing file handles is a completely separate task. Squashing both, invisibly, into one action, is surprising to humans who read the code and makes it more difficult to reason about program behavior.

Other languages have essentially come to the same conclusion. Haskell briefly flirted with so-called "lazy IO" which allows you to iterate over a file and have it automatically closed when you get to the end of the stream, but it’s almost universally discouraged to use lazy IO in Haskell these days, and Haskell users have mostly moved to more explicit resource management like Conduit which behaves more like the with block in Python.

On a technical level, there are some things you may want to do with a file handle in Python which would not work as well if iteration closed the file handle. For example, suppose I need to iterate over the file twice:

with open('filename.txt') as fp:
    for line in fp:
        ...
    fp.seek(0)
    for line in fp:
        ...

While this is a less common use case, consider the fact that I might have just added the three lines of code at the bottom to an existing code base which originally had the top three lines. If iteration closed the file, I wouldn’t be able to do that. So keeping iteration and resource management separate makes it easier to compose chunks of code into a larger, working Python program.

Composability is one of the most important usability features of a language or API.

Answered By: Dietrich Epp

if you’re turned off by the extra line, you can use a wrapper function like so:

def with_iter(iterable):
    with iterable as iter:
        for item in iter:
            yield item

for line in with_iter(open('...')):
    ...

in Python 3.3, the yield from statement would make this even shorter:

def with_iter(iterable):
    with iterable as iter:
        yield from iter
Answered By: Lie Ryan

— Adding on to the answer given —

When i was reading file in chunk let’s suppose a text file with the name of split.txt the issue i was facing while reading in chunks was I had a use case where i was processing the data line by line and just because the text file i was reading in chunks it(chunk of file) sometimes end with partial lines that end up breaking my code(since it was expecting the complete line to be processed)

so after reading here and there I came to know I can overcome this issue by keeping a track of the last bit in the chunk so what I did was if the chunk has a /n in it that means the chunk consists of a complete line otherwise I usually store the partial last line and keep it in a variable so that I can use this bit and concatenate it with the next unfinished line coming in the next chunk with this I successfully able to get over this issue.

sample code :-

# in this function i am reading the file in chunks
def read_in_chunks(file_object, chunk_size=1024):
    """Lazy function (generator) to read a file piece by piece.
    Default chunk size: 1k."""
    while True:
        data = file_object.read(chunk_size)
        if not data:
            break
        yield data

# file where i am writing my final output
write_file=open('split.txt','w')

# variable i am using to store the last partial line from the chunk
placeholder= ''
file_count=1

try:
    with open('/Users/rahulkumarmandal/Desktop/combined.txt') as f:
        for piece in read_in_chunks(f):
            #print('---->>>',piece,'<<<--')
            line_by_line = piece.split('n')

            for one_line in line_by_line:
                # if placeholder exist before that means last chunk have a partial line that we need to concatenate with the current one 
                if placeholder:
                    # print('----->',placeholder)
                    # concatinating the previous partial line with the current one
                    one_line=placeholder+one_line
                    # then setting the placeholder empty so that next time if there's a partial line in the chunk we can place it in the variable to be concatenated further
                    placeholder=''
                
                # futher logic that revolves around my specific use case
                segregated_data= one_line.split('~')
                #print(len(segregated_data),type(segregated_data), one_line)
                if len(segregated_data) < 18:
                    placeholder=one_line
                    continue
                else:
                    placeholder=''
                #print('--------',segregated_data)
                if segregated_data[2]=='2020' and segregated_data[3]=='2021':
                    #write this
                    data=str("~".join(segregated_data))
                    #print('data',data)
                    #f.write(data)
                    write_file.write(data)
                    write_file.write('n')
                    print(write_file.tell())
                elif segregated_data[2]=='2021' and segregated_data[3]=='2022':
                    #write this
                    data=str("-".join(segregated_data))
                    write_file.write(data)
                    write_file.write('n')
                    print(write_file.tell())
except Exception as e:
    print('error is', e)                
Answered By: officialrahulmandal
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.