What's the most efficient way to skip first line of a file while reading the whole file in a list of lines

Question:

I want to read a file in one shot in my Python script. Also, I want to skip first line while doing so.

I can think of a few ways:

1.

    with open('file name', 'r') as fh:
        fh.readline()
        lines = fh.readlines()        
    with open('file name', 'r') as fh:
        lines = fh.readlines()
        del lines[0]
    with open('file name', 'r') as fh:
        lines = fh.readlines()[1:]

Please let me know what you think. It’ll be great if you can provide any references.

Please note that I’m not looking to find ways to skip first line. As can be seen, I already have 3 ways to do so. What I’m looking for is what’s the most efficient way and why. It’s possible that I’ve not listed the most efficient way.

I believe

#1 may be most efficient as the offset would have been moved past first line by readline and then we just read rest of the lines.

#2: not really sure whether it involves moving all elements by one or just the pointer is moved.

#3: It’ll involve creating another list, which may be least efficient.

Asked By: SRK

||

Answers:

With a limited test in ipython using %%timeit and sample data only 1000 lines long, #1 does indeed seem to be the fastest, but it is extremely negligible. Using 1,000,000 blank lines, a bigger difference can be seen, with an approach that you did not consider earlier pulling ahead.

To determine the relative performance between different blocks of code, you need to profile the code in question. One of the easiest ways to profile a given function or short snippet of code is using the %timeit "magic" command in ipython.

For this test, initially I used the following sample data:

chars = [chr(c) for c in range(97, 123)]
line = ','.join(c * 5 for c in chars)
# 'aaaaa,bbbbb,ccccc,ddddd,eeeee,fffff,ggggg,hhhhh,iiiii,jjjjj,kkkkk,lllll,mmmmm,nnnnn,ooooo,ppppp,qqqqq,rrrrr,sssss,ttttt,uuuuu,vvvvv,wwwww,xxxxx,yyyyy,zzzzz'
with open('test.txt', 'w', encoding='utf-8') as f:
    f.write('n'.join(line for _ in range(1000)))

The approach that was the fastest:

>>> %%timeit
... with open('test.txt', 'r', encoding='utf-8') as f:
...     next(f)  # roughly equivalent to f.readline()
...     data = f.readlines()
...
166 µs ± 185 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

The other two examples you had were slightly slower:

>>> %%timeit
... with open('test.txt', 'r', encoding='utf-8') as f:
...     data = f.readlines()[1:]
...
177 µs ± 5.06 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

>>> %%timeit
... with open('test.txt', 'r', encoding='utf-8') as f:
...     data = f.readlines()
...     del data[0]
...
168 µs ± 893 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Using 1,000,000 blank lines as follows, we can see a bigger difference between approaches:

with open('test_1.txt', 'w', encoding='utf-8') as f:
    f.write('n' * 1_000_000)

The initial approaches:

>>> %%timeit
... with open('test_1.txt', 'r', encoding='utf-8') as f:
...     next(f)
...     data = f.readlines()
...
20.4 ms ± 226 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %%timeit
... with open('test_1.txt', 'r', encoding='utf-8') as f:
...     f.readline()
...     data = f.readlines()
...
20.6 ms ± 197 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %%timeit
... with open('test_1.txt', 'r', encoding='utf-8') as f:
...     data = f.readlines()[1:]
...
22.2 ms ± 250 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %%timeit
... with open('test_1.txt', 'r', encoding='utf-8') as f:
...     data = f.readlines()
...     del data[0]
...
20.7 ms ± 215 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

The slice approach takes the longest since it needs to do more work to construct a new list.

Alternate approaches that pulled ahead included reading the file in its entirety in one .read() call, then splitting it:

>>> %%timeit
... with open('test_1.txt', 'r', encoding='utf-8') as f:
...     data = f.read().splitlines()
...     del data[0]
...
15.8 ms ± 104 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %%timeit
... with open('test_1.txt', 'r', encoding='utf-8') as f:
...     data = f.read().split('n', 1)[1].splitlines()
...
15.2 ms ± 220 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %%timeit
... with open('test_1.txt', 'r', encoding='utf-8') as f:
...     next(f)
...     data = f.read().splitlines()
...
15.2 ms ± 221 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The absolute fastest approach I have found so far involved reading the file as binary data, then decoding after reading:

>>> %%timeit
... with open('test_1.txt', 'rb') as f:
...     next(f)
...     data = f.read().decode('utf-8').splitlines()
...
14.2 ms ± 171 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In the end, it depends on how much data you need to read, and how much memory you have available. For files with fewer lines, the difference between approaches is extremely negligible.

Avoiding slicing in this scenario is always preferable. Reading more data in fewer system calls generally produces faster results because more of the post-processing operations can be performed in memory instead of on a file handle. If you don’t have enough memory though, then this may not be possible.

Note that for any of these approaches, run times can vary between trials. In my original test with 1k lines, the approach that was tied for fastest on the first run was slower on a later run:

>>> %%timeit
... with open('test.txt', 'r', encoding='utf-8') as f:
...     next(f)
...     data = f.readlines()
...
172 µs ± 4.09 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

It’s also important to note that premature optimization is the root of all evil – if this is not a major bottleneck in your program (as revealed by profiling your code), then it’s not worth spending a lot of time on it.

I would recommend reviewing some more resources about how to profile your code:

Answered By: dskrypa
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.