How to get line count of a large file cheaply in Python?


How do I get a line count of a large file in the most memory- and time-efficient manner?

def file_len(filename):
    with open(filename) as f:
        for i, _ in enumerate(f):
    return i + 1
Asked By: SilentGhost



You could execute a subprocess and run wc -l filename

import subprocess

def file_len(fname):
    p = subprocess.Popen(['wc', '-l', fname], stdout=subprocess.PIPE, 
    result, err = p.communicate()
    if p.returncode != 0:
        raise IOError(err)
    return int(result.strip().split()[0])
Answered By: Ólafur Waage
def file_len(full_path):
  """ Count number of lines in a file."""
  f = open(full_path)
  nr_of_lines = sum(1 for line in f)
  return nr_of_lines
Answered By: pkit

You can’t get any better than that.

After all, any solution will have to read the entire file, figure out how many n you have, and return that result.

Do you have a better way of doing that without reading the entire file? Not sure… The best solution will always be I/O-bound, best you can do is make sure you don’t use unnecessary memory, but it looks like you have that covered.

Answered By: Yuval Adam

As for me this variant will be the fastest:

#!/usr/bin/env python

def main():
    f = open('filename')                  
    lines = 0
    buf_size = 1024 * 1024
    read_f = # loop optimization

    buf = read_f(buf_size)
    while buf:
        lines += buf.count('n')
        buf = read_f(buf_size)

    print lines

if __name__ == '__main__':

reasons: buffering faster than reading line by line and string.count is also very fast

Answered By: Mykola Kharechko

the result of opening a file is an iterator, which can be converted to a sequence, which has a length:

with open(filename) as f:
   return len(list(f))

this is more concise than your explicit loop, and avoids the enumerate.

Answered By: Andrew Jaffe

What about this

def file_len(fname):
  counts = itertools.count()
  with open(fname) as f: 
    for _ in f:
Answered By: odwl

Why not read the first 100 and the last 100 lines and estimate the average line length, then divide the total file size through that numbers? If you don’t need a exact value this could work.

Answered By: Georg Schölly

I believe that a memory mapped file will be the fastest solution. I tried four functions: the function posted by the OP (opcount); a simple iteration over the lines in the file (simplecount); readline with a memory-mapped filed (mmap) (mapcount); and the buffer read solution offered by Mykola Kharechko (bufcount).

I ran each function five times, and calculated the average run-time for a 1.2 million-line text file.

Windows XP, Python 2.5, 2GB RAM, 2 GHz AMD processor

Here are my results:

mapcount : 0.465599966049
simplecount : 0.756399965286
bufcount : 0.546800041199
opcount : 0.718600034714

Edit: numbers for Python 2.6:

mapcount : 0.471799945831
simplecount : 0.634400033951
bufcount : 0.468800067902
opcount : 0.602999973297

So the buffer read strategy seems to be the fastest for Windows/Python 2.6

Here is the code:

from __future__ import with_statement
import time
import mmap
import random
from collections import defaultdict

def mapcount(filename):
    f = open(filename, "r+")
    buf = mmap.mmap(f.fileno(), 0)
    lines = 0
    readline = buf.readline
    while readline():
        lines += 1
    return lines

def simplecount(filename):
    lines = 0
    for line in open(filename):
        lines += 1
    return lines

def bufcount(filename):
    f = open(filename)                  
    lines = 0
    buf_size = 1024 * 1024
    read_f = # loop optimization

    buf = read_f(buf_size)
    while buf:
        lines += buf.count('n')
        buf = read_f(buf_size)

    return lines

def opcount(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
    return i + 1

counts = defaultdict(list)

for i in range(5):
    for func in [mapcount, simplecount, bufcount, opcount]:
        start_time = time.time()
        assert func("big_file.txt") == 1209138
        counts[func].append(time.time() - start_time)

for key, vals in counts.items():
    print key.__name__, ":", sum(vals) / float(len(vals))
Answered By: Ryan Ginstrom

One line, probably pretty fast:

num_lines = sum(1 for line in open('myfile.txt'))
Answered By: Kyle

Just to complete the above methods I tried a variant with the fileinput module:

import fileinput as fi   
def filecount(fname):
        for line in fi.input(fname):
        return fi.lineno()

And passed a 60mil lines file to all the above stated methods:

mapcount : 6.1331050396
simplecount : 4.588793993
opcount : 4.42918205261
filecount : 43.2780818939
bufcount : 0.170812129974

It’s a little surprise to me that fileinput is that bad and scales far worse than all the other methods…

Answered By: BandGap

what about this?

import sys
print "counted",len(data),"lines"
Answered By: S.C

Why wouldn’t the following work?

import sys

# input comes from STDIN
file = sys.stdin
data = file.readlines()

# get total number of lines in file
lines = len(data)

print lines

In this case, the len function uses the input lines as a means of determining the length.

Answered By: krakatit

count = max(enumerate(open(filename)))[0]

Answered By: pyanon

How about this?

import fileinput
import sys

for line in fileinput.input([sys.argv[1]]):

print counter
Answered By: leba-lev

Here is a python program to use the multiprocessing library to distribute the line counting across machines/cores. My test improves counting a 20million line file from 26 seconds to 7 seconds using an 8 core windows 64 server. Note: not using memory mapping makes things much slower.

import multiprocessing, sys, time, os, mmap
import logging, logging.handlers

def init_logger(pid):
    console_format = 'P{0} %(levelname)s %(message)s'.format(pid)
    logger = logging.getLogger()  # New logger at root level
    logger.setLevel( logging.INFO )
    logger.handlers.append( logging.StreamHandler() )
    logger.handlers[0].setFormatter( logging.Formatter( console_format, '%d/%m/%y %H:%M:%S' ) )

def getFileLineCount( queues, pid, processes, file1 ):
    init_logger(pid) 'start' )

    physical_file = open(file1, "r")
    #  mmap.mmap(fileno, length[, tagname[, access[, offset]]]

    m1 = mmap.mmap( physical_file.fileno(), 0, access=mmap.ACCESS_READ )

    #work out file size to divide up line counting

    fSize = os.stat(file1).st_size
    chunk = (fSize / processes) + 1

    lines = 0

    #get where I start and stop
    _seedStart = chunk * (pid)
    _seekEnd = chunk * (pid+1)
    seekStart = int(_seedStart)
    seekEnd = int(_seekEnd)

    if seekEnd < int(_seekEnd + 1):
        seekEnd += 1

    if _seedStart < int(seekStart + 1):
        seekStart += 1

    if seekEnd > fSize:
        seekEnd = fSize

    #find where to start
    if pid > 0: seekStart )
        #read next line
        l1 = m1.readline()  # need to use readline with memory mapped files
        seekStart = m1.tell()

    #tell previous rank my seek start to make their seek end

    if pid > 0:
        queues[pid-1].put( seekStart )
    if pid < processes-1:
        seekEnd = queues[pid].get() seekStart )
    l1 = m1.readline()

    while len(l1) > 0:
        lines += 1
        l1 = m1.readline()
        if m1.tell() > seekEnd or len(l1) == 0:
            break 'done' )
    # add up the results
    if pid == 0:
        for p in range(1,processes):
            lines += queues[0].get()
        queues[0].put(lines) # the total lines counted


if __name__ == '__main__':
    init_logger( 'main' )
    if len(sys.argv) > 1:
        file_name = sys.argv[1]
        logging.fatal( 'parameters required: file-name [processes]' )

    t = time.time()
    processes = multiprocessing.cpu_count()
    if len(sys.argv) > 2:
        processes = int(sys.argv[2])
    queues=[] # a queue for each process
    for pid in range(processes):
        queues.append( multiprocessing.Queue() )
    prev_pipe = 0
    for pid in range(processes):
        p = multiprocessing.Process( target = getFileLineCount, args=(queues, pid, processes, file_name,) )

    jobs[0].join() #wait for counting to finish
    lines = queues[0].get() 'finished {} Lines:{}'.format( time.time() - t, lines ) )
Answered By: Martlark

I have modified the buffer case like this:

def CountLines(filename):
    f = open(filename)
        lines = 1
        buf_size = 1024 * 1024
        read_f = # loop optimization
        buf = read_f(buf_size)

        # Empty file
        if not buf:
            return 0

        while buf:
            lines += buf.count('n')
            buf = read_f(buf_size)

        return lines

Now also empty files and the last line (without n) are counted.

Answered By: Dummy

I got a small (4-8%) improvement with this version which re-uses a constant buffer so it should avoid any memory or GC overhead:

lines = 0
buffer = bytearray(2048)
with open(filename) as f:
  while f.readinto(buffer) > 0:
      lines += buffer.count('n')

You can play around with the buffer size and maybe see a little improvement.

Answered By: Scott Persinger


lines = 0
with open(path) as f:
    for line in f:
        lines += 1
Answered By: Colonel Panic

How about this one-liner:

file_length = len(open('myfile.txt','r').read().split('n'))

Takes 0.003 sec using this method to time it on a 3900 line file

def c():
  import time
  s = time.time()
  file_length = len(open('myfile.txt','r').read().split('n'))
  print time.time() - s
Answered By: onetwopunch

I would use Python’s file object method readlines, as follows:

with open(input_file) as foo:
    lines = len(foo.readlines())

This opens the file, creates a list of lines in the file, counts the length of the list, saves that to a variable and closes the file again.

Answered By: Daniel Lee
print open('file.txt', 'r').read().count("n") + 1
Answered By: Andrés Torres
def line_count(path):
    count = 0
    with open(path) as lines:
        for count, l in enumerate(lines, start=1):
    return count
Answered By: mdwhatcott

If one wants to get the line count cheaply in Python in Linux, I recommend this method:

import os
print os.popen("wc -l file_path").readline().split()[0]

file_path can be both abstract file path or relative path. Hope this may help.

Answered By: Lerner Zhang

Kyle’s answer

num_lines = sum(1 for line in open('my_file.txt'))

is probably best, an alternative for this is

num_lines =  len(open('my_file.txt').read().splitlines())

Here is the comparision of performance of both

In [20]: timeit sum(1 for line in open('Charts.ipynb'))
100000 loops, best of 3: 9.79 µs per loop

In [21]: timeit len(open('Charts.ipynb').read().splitlines())
100000 loops, best of 3: 12 µs per loop
Answered By: Chillar Anand

You can use the os.path module in the following way:

import os
import subprocess
Number_lines = int( (subprocess.Popen( 'wc -l {0}'.format( Filename ), shell=True, stdout=subprocess.PIPE).stdout).readlines()[0].split()[0] )

, where Filename is the absolute path of the file.

Answered By: Victor

I had to post this on a similar question until my reputation score jumped a bit (thanks to whoever bumped me!).

All of these solutions ignore one way to make this run considerably faster, namely by using the unbuffered (raw) interface, using bytearrays, and doing your own buffering. (This only applies in Python 3. In Python 2, the raw interface may or may not be used by default, but in Python 3, you’ll default into Unicode.)

Using a modified version of the timing tool, I believe the following code is faster (and marginally more pythonic) than any of the solutions offered:

def rawcount(filename):
    f = open(filename, 'rb')
    lines = 0
    buf_size = 1024 * 1024
    read_f =

    buf = read_f(buf_size)
    while buf:
        lines += buf.count(b'n')
        buf = read_f(buf_size)

    return lines

Using a separate generator function, this runs a smidge faster:

def _make_gen(reader):
    b = reader(1024 * 1024)
    while b:
        yield b
        b = reader(1024*1024)

def rawgencount(filename):
    f = open(filename, 'rb')
    f_gen = _make_gen(
    return sum( buf.count(b'n') for buf in f_gen )

This can be done completely with generators expressions in-line using itertools, but it gets pretty weird looking:

from itertools import (takewhile,repeat)

def rawincount(filename):
    f = open(filename, 'rb')
    bufgen = takewhile(lambda x: x, (*1024) for _ in repeat(None)))
    return sum( buf.count(b'n') for buf in bufgen )

Here are my timings:

function      average, s  min, s   ratio
rawincount        0.0043  0.0041   1.00
rawgencount       0.0044  0.0042   1.01
rawcount          0.0048  0.0045   1.09
bufcount          0.008   0.0068   1.64
wccount           0.01    0.0097   2.35
itercount         0.014   0.014    3.41
opcount           0.02    0.02     4.83
kylecount         0.021   0.021    5.05
simplecount       0.022   0.022    5.25
mapcount          0.037   0.031    7.46
Answered By: Michael Bacon

Another possibility:

import subprocess

def num_lines_in_file(fpath):
    return int(subprocess.check_output('wc -l %s' % fpath, shell=True).strip().split()[0])
Answered By: J.J.

This code is shorter and clearer. It’s probably the best way:

num_lines = open('yourfile.ext').read().count('n')
Answered By: Texom512

This is the fastest thing I have found using pure python.
You can use whatever amount of memory you want by setting buffer, though 2**16 appears to be a sweet spot on my computer.

from functools import partial

with open(myfile) as f:
        print sum(x.count('n') for x in iter(partial(,buffer), ''))

I found the answer here Why is reading lines from stdin much slower in C++ than Python? and tweaked it just a tiny bit. Its a very good read to understand how to count lines quickly, though wc -l is still about 75% faster than anything else.

Answered By: jeffpkamp

One line solution:

import os
os.system("wc -l  filename")  

My snippet:

>>> os.system('wc -l *.txt')

0 bar.txt
1000 command.txt
3 test_file.txt
1003 total
Answered By: TheExorcist

A one-line bash solution similar to this answer, using the modern subprocess.check_output function:

def line_count(filename):
    return int(subprocess.check_output(['wc', '-l', filename]).split()[0])
Answered By: 1''

Here is what I use, seems pretty clean:

import subprocess

def count_file_lines(file_path):
    Counts the number of lines in a file using wc utility.
    :param file_path: path to file
    :return: int, no of lines
    num = subprocess.check_output(['wc', '-l', file_path])
    num = num.split(' ')
    return int(num[0])

UPDATE: This is marginally faster than using pure python but at the cost of memory usage. Subprocess will fork a new process with the same memory footprint as the parent process while it executes your command.

Answered By: radtek
def count_text_file_lines(path):
    with open(path, 'rt') as file:
        line_count = sum(1 for _line in file)
    return line_count
Answered By: jciloa

If the file can fit into memory, then

with open(fname) as f:
    count = len('n')) - 1
Answered By: Karthik

Create an executable script file named


import sys
count = 0
for line in sys.stdin:

And then pipe the file’s content into the python script: cat huge.txt | ./ Pipe works also on Powershell, so you will end up counting number of lines.

For me, on Linux it was 30% faster than the naive solution:

with open('huge.txt') as f:
Answered By: 0x90

If all the lines in your file are the same length (and contain only ASCII characters)*, you can do the following very cheaply:

fileSize     = os.path.getsize( pathToFile )  # file size in bytes
bytesPerLine = someInteger                    # don't forget to account for the newline character
numLines     = fileSize // bytesPerLine

*I suspect more effort would be required to determine the number of bytes in a line if unicode characters like é are used.

Answered By: Jet Blue

Simple method:


>>> f = len(open("myfile.txt").readlines())
>>> f

>>> f = open("myfile.txt").read().count('n')
>>> f
num_lines = len(list(open('myfile.txt')))

An alternative for big files is using xreadlines():

count = 0
for line in open(thefilepath).xreadlines(  ): count += 1

For Python 3 please see: What substitutes xreadlines() in Python 3?

Answered By: alexisdevarennes

Using Numba

We can use Numba to JIT (Just in time) compile our function to machine code. def numbacountparallel(fname) runs 2.8x faster
than def file_len(fname) from the question.


The OS had already cached the file to memory before the benchmarks were run as I don’t see much disk activity on my PC.
The time would be much slower when reading the file for the first time making the time advantage of using Numba insignificant.

The JIT compilation takes extra time the first time the function is called.

This would be useful if we were doing more than just counting lines.

Cython is another option.


As counting lines will be IO bound, use the def file_len(fname) from the question unless you want to do more than just count lines.

import timeit

from numba import jit, prange
import numpy as np

from itertools import (takewhile,repeat)

FILE = '../data/us_confirmed.csv' # 40.6MB, 371755 line file
CR = ord('n')

# Copied from the question above. Used as a benchmark
def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
    return i + 1

# Copied from another answer. Used as a benchmark
def rawincount(filename):
    f = open(filename, 'rb')
    bufgen = takewhile(lambda x: x, (*1024*10) for _ in repeat(None)))
    return sum( buf.count(b'n') for buf in bufgen )

# Single thread
def numbacountsingle_chunk(bs):

    c = 0
    for i in range(len(bs)):
        if bs[i] == CR:
            c += 1

    return c

def numbacountsingle(filename):
    f = open(filename, "rb")
    total = 0
    while True:
        chunk =*1024*10)
        lines = numbacountsingle_chunk(chunk)
        total += lines
        if not chunk:

    return total

# Multi thread
@jit(nopython=True, parallel=True)
def numbacountparallel_chunk(bs):

    c = 0
    for i in prange(len(bs)):
        if bs[i] == CR:
            c += 1

    return c

def numbacountparallel(filename):
    f = open(filename, "rb")
    total = 0
    while True:
        chunk =*1024*10)
        lines = numbacountparallel_chunk(np.frombuffer(chunk, dtype=np.uint8))
        total += lines
        if not chunk:

    return total

print(numbacountparallel(FILE)) # This allows Numba to compile and cache the function without adding to the time.
print(timeit.Timer(lambda: numbacountparallel(FILE)).timeit(number=100))

print(timeit.Timer(lambda: numbacountsingle(FILE)).timeit(number=100))

print(timeit.Timer(lambda: rawincount(FILE)).timeit(number=100))

print(timeit.Timer(lambda: rawincount(FILE)).timeit(number=100))

Time in seconds for 100 calls to each function




Answered By: Paul Menzies

This is a meta-comment on some of the other answers.

  • The line-reading and buffered n-counting techniques won’t return the same answer for every file, because some text files have no newline at the end of the last line. You can work around this by checking the last byte of the last nonempty buffer and adding 1 if it’s not b'n'.

  • In Python 3, opening the file in text mode and in binary mode can yield different results, because text mode by default recognizes CR, LF, and CRLF as line endings (converting them all to 'n'), while in binary mode only LF and CRLF will be counted if you count b'n'. This applies whether you read by lines or into a fixed-size buffer. The classic Mac OS used CR as a line ending; I don’t know how common those files are these days.

  • The buffer-reading approach uses a bounded amount of RAM independent of file size, while the line-reading approach could read the entire file into RAM at once in the worst case (especially if the file uses CR line endings). In the worst case it may use substantially more RAM than the file size, because of overhead from dynamic resizing of the line buffer and (if you opened in text mode) Unicode decoding and storage.

  • You can improve the memory usage, and probably the speed, of the buffered approach by pre-allocating a bytearray and using readinto instead of read. One of the existing answers (with few votes) does this, but it’s buggy (it double-counts some bytes).

  • The top buffer-reading answer uses a large buffer (1 MiB). Using a smaller buffer can actually be faster because of OS readahead. If you read 32K or 64K at a time, the OS will probably start reading the next 32K/64K into the cache before you ask for it, and each trip to the kernel will return almost immediately. If you read 1 MiB at a time, the OS is unlikely to speculatively read a whole megabyte. It may preread a smaller amount but you will still spend a significant amount of time sitting in the kernel waiting for the disk to return the rest of the data.

Answered By: benrg

A lot of answers already, but unfortunately most of them are just tiny economies on a barely optimizable problem…

I worked on several projects where line count was the core function of the software, and working as fast as possible with a huge number of files was of paramount importance.

The main bottleneck with line count is I/O access, as you need to read each line in order to detect the line return character, there is simply no way around. The second potential bottleneck is memory management: the more you load at once, the faster you can process, but this bottleneck is negligible compared to the first.

Hence, there are 3 major ways to reduce the processing time of a line count function, apart from tiny optimizations such as disabling gc collection and other micro-managing tricks:

  1. Hardware solution: the major and most obvious way is non-programmatic: buy a very fast SSD/flash hard drive. By far, this is how you can get the biggest speed boosts.

  2. Data preparation solution: if you generate or can modify how the files you process are generated, or if it’s acceptable that you can pre-process them, first convert the line return to unix style (n) as this will save 1 character compared to Windows or MacOS styles (not a big save but it’s an easy gain), and secondly and most importantly, you can potentially write lines of fixed length. If you need variable length, you can always pad smaller lines. This way, you can calculate instantly the number of lines from the total filesize, which is much faster to access. Often, the best solution to a problem is to pre-process it so that it better fits your end purpose.

  3. Parallelization + hardware solution: if you can buy multiple hard disks (and if possible SSD flash disks), then you can even go beyond the speed of one disk by leveraging parallelization, by storing your files in a balanced way (easiest is to balance by total size) among disks, and then read in parallel from all those disks. Then, you can expect to get a multiplier boost in proportion with the number of disks you have. If buying multiple disks is not an option for you, then parallelization likely won’t help (except if your disk has multiple reading headers like some professional-grade disks, but even then the disk’s internal cache memory and PCB circuitry will likely be a bottleneck and prevent you from fully using all heads in parallel, plus you have to devise a specific code for this hard drive you’ll use because you need to know the exact cluster mapping so that you store your files on clusters under different heads, and so that you can read them with different heads after). Indeed, it’s commonly known that sequential reading is almost always faster than random reading, and parallelization on a single disk will have a performance more similar to random reading than sequential reading (you can test your hard drive speed in both aspects using CrystalDiskMark for example).

If none of those are an option, then you can only rely on micro-managing tricks to improve by a few percents the speed of your line counting function, but don’t expect anything really significant. Rather, you can expect the time you’ll spend tweaking will be disproportionated compared to the returns in speed improvement you’ll see.

Answered By: gaborous

After a perfplot analysis, one has to recommend the buffered read solution

def buf_count_newlines_gen(fname):
    def _make_gen(reader):
        while True:
            b = reader(2 ** 16)
            if not b: break
            yield b

    with open(fname, "rb") as f:
        count = sum(buf.count(b"n") for buf in _make_gen(
    return count

It’s fast and memory-efficient. Most other solutions are about 20 times slower.

enter image description here

Code to reproduce the plot:

import mmap
import subprocess
from functools import partial

import perfplot

def setup(n):
    fname = "t.txt"
    with open(fname, "w") as f:
        for i in range(n):
            f.write(str(i) + "n")
    return fname

def for_enumerate(fname):
    i = 0
    with open(fname) as f:
        for i, _ in enumerate(f):
    return i + 1

def sum1(fname):
    return sum(1 for _ in open(fname))

def mmap_count(fname):
    with open(fname, "r+") as f:
        buf = mmap.mmap(f.fileno(), 0)

    lines = 0
    while buf.readline():
        lines += 1
    return lines

def for_open(fname):
    lines = 0
    for _ in open(fname):
        lines += 1
    return lines

def buf_count_newlines(fname):
    lines = 0
    buf_size = 2 ** 16
    with open(fname) as f:
        buf =
        while buf:
            lines += buf.count("n")
            buf =
    return lines

def buf_count_newlines_gen(fname):
    def _make_gen(reader):
        b = reader(2 ** 16)
        while b:
            yield b
            b = reader(2 ** 16)

    with open(fname, "rb") as f:
        count = sum(buf.count(b"n") for buf in _make_gen(
    return count

def wc_l(fname):
    return int(subprocess.check_output(["wc", "-l", fname]).split()[0])

def sum_partial(fname):
    with open(fname) as f:
        count = sum(x.count("n") for x in iter(partial(, 2 ** 16), ""))
    return count

def read_count(fname):
    return open(fname).read().count("n")

b = perfplot.bench(
    n_range=[2 ** k for k in range(27)],
    xlabel="num lines",
Answered By: Nico Schlömer

Simplest and shortest way I would use is:

f = open("my_file.txt", "r")
Answered By: DesiKeki

i found that you can just.

f = open("data.txt")
linecout = len(f.readlines())

will give you an answer

Answered By: Captain Peter
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.