Counting the number of lines in a gzip file using python

Question:

I’m trying to count the number of lines in a gz archive. There is only 1 json format text file per gz. But when I open the archive and count the lines the count is way off what I’d expect. The file contains 522 lines, but my code is returning 668480 lines.

import gzip
f = gzip.open(myfile, 'rb')
file_content = f.read()
for i, l in enumerate(file_content):
    pass
i += 1
print("File {1} contain {0} lines".format(i, myfile))
Asked By: John

||

Answers:

You are iterating over all characters not the lines. You can iterate lines the following way

import gzip
with gzip.open(myfile, 'rb') as f:
    for i, l in enumerate(f):
        pass
print("File {1} contain {0} lines".format(i + 1, myfile))
Answered By: Dmitry Kovriga

For a performant way to count the lines in a gzip file you can use the pragzip package:

import pragzip

result = 0
with pragzip.open(myfile) as file:
    while chunk := file.read( 1024*1024 ):
        result += chunk.count(b'n')
print(f"Number of lines: {result}")

Comparing the timing of the above with @DmitryKovriga’s answer:

Number of lines: 33468793
Elapsed time is 22.373915 seconds.

File datasets/binance-futures_incremental_book_L2_2020-07-01_BTCUSDT.csv.gz contain 33468793 lines
Elapsed time is 31.278056 seconds.

A speed up of more like 10x should be possible with a suitable setup. See https://unix.stackexchange.com/a/713093/163459 for more info.

Answered By: James Hirschorn
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.