(Python) Counting lines in a huge (>10GB) file as fast as possible

Question

I have a really simple script right now that counts lines in a text file using enumerate():

i = 0
f = open("C:/Users/guest/Desktop/file.log", "r")
for i, line in enumerate(f):
      pass
print i + 1
f.close()

This takes around 3 and a half minutes to go through a 15GB log file with ~30 million lines. It would be great if I could get this under two minutes or less, because these are daily logs and we want to do a monthly analysis, so the code will have to process 30 logs of ~15GB – more than one and a half hour possibly, and we’d like to minimise the time & memory load on the server.

I would also settle for a good approximation/estimation method, but it needs to be about 4 sig fig accurate…

Thank you!

Asked By: Adrienne

||

Source

Answer 1

mmap the file, and count up the newlines.

import mmap

def mapcount(filename):
    f = open(filename, "r+")
    buf = mmap.mmap(f.fileno(), 0)
    lines = 0
    readline = buf.readline
    while readline():
        lines += 1
    return lines

Answered By: Ignacio Vazquez-Abrams

Answer 2

Ignacio’s answer is correct, but might fail if you have a 32 bit process.

But maybe it could be useful to read the file block-wise and then count the n characters in each block.

def blocks(files, size=65536):
    while True:
        b = files.read(size)
        if not b: break
        yield b

with open("file", "r") as f:
    print sum(bl.count("n") for bl in blocks(f))

will do your job.

Note that I don’t open the file as binary, so the rn will be converted to n, making the counting more reliable.

For Python 3, and to make it more robust, for reading files with all kinds of characters:

def blocks(files, size=65536):
    while True:
        b = files.read(size)
        if not b: break
        yield b

with open("file", "r",encoding="utf-8",errors='ignore') as f:
    print (sum(bl.count("n") for bl in blocks(f)))

Answered By: glglgl

Answer 3

I know its a bit unfair but you could do this

int(subprocess.check_output("wc -l C:\alarm.bat").split()[0])

If you’re on Windows, check out Coreutils.

Answered By: Jakob Bowyer

Answer 4

A fast, 1-line solution is:

sum(1 for i in open(file_path, 'rb'))

It should work on files of arbitrary size.

Answered By: AJSmyth

Answer 5

I’d extend gl’s answer and run his/her code using multiprocessing Python module for faster count:

def blocks(f, cut, size=64*1024): # 65536
    start, chunk =cut
    iter=0
    read_size=int(size)
    _break =False
    while not _break:
        if _break: break
        if f.tell()+size>start+chunk:
            read_size=int(start+chunk- f.tell() )
            _break=True
        b = f.read(read_size)
        iter +=1
        if not b: break
        yield b


def get_chunk_line_count(data):
    fn,  chunk_id, cut = data
    start, chunk =cut
    cnt =0
    last_bl=None

    with open(fn, "r") as f:
        if 0:
            f.seek(start)
            bl = f.read(chunk)
            cnt= bl.count('n')
        else:
            f.seek(start)
            for i, bl  in enumerate(blocks(f,cut)):
                cnt +=  bl.count('n')
                last_bl=bl

        if not last_bl.endswith('n'):
            cnt -=1

        return cnt
....
pool = multiprocessing.Pool(processes=pool_size,
                            initializer=start_process,
                            )
pool_outputs = pool.map(get_chunk_line_count, inputs)
pool.close() # no more tasks
pool.join()

This will improve counting performance 20 folds.
I wrapped it to a script and put it to Github.

Answered By: olekb

(Python) Counting lines in a huge (>10GB) file as fast as possible

Question:

Answers: