python: read lines from compressed text files

Question:

Is it easy to read a line from a gz-compressed text file using python without extracting the file completely? I have a text.gz file which is aroud 200mb. When I extract it, it becomes 7.4gb. And this is not the only file I have to read. For the total process, I have to read 10 files. Although this will be a sequential job, I think it will a smart thing to do it without extarcting the whole information. I do not even know that it is possible. How can it be done using python? I need to read a text file line-by-line.

Answers:

Have you tried using gzip.GzipFile? Arguments are similar to open.

Answered By: jrennie

You could use the standard gzip module in python. Just use:

gzip.open('myfile.gz')

to open the file as any other file and read its lines.

More information here: Python gzip module

Answered By: smichak

Using gzip.GzipFile:

import gzip

with gzip.open('input.gz','rt') as f:
    for line in f:
        print('got line', line)

Note: gzip.open(filename, mode) is an alias for gzip.GzipFile(filename, mode).
I prefer the former, as it looks similar to with open(...) as f: used for opening uncompressed files.

Answered By: fferri

The gzip library (obviously) uses gzip, which can be a bit slow. You can speed things up with a system call to pigz, the parallelized version of gzip. The downsides are you have to install pigz and it will take more cores during the run, but it is much faster and not more memory intensive. The call to the file then becomes os.popen('pigz -dc ' + filename) instead of gzip.open(filename,'rt'). The pigz flags are -d for decompress and -c for stdout output which can then be grabbed by os.popen.

The following code take in a file and a number (1 or 2) and counts the number of lines in the file with the different calls while measuring the time the code takes. Defining the following code in the unzip-file.py:

#!/usr/bin/python
import os
import sys
import time
import gzip

def local_unzip(obj):
    t0 = time.time()
    count = 0
    with obj as f:
        for line in f:
            count += 1
    print(time.time() - t0, count)

r = sys.argv[1]
if sys.argv[2] == "1":
    local_unzip(gzip.open(r,'rt'))
else:
    local_unzip(os.popen('pigz -dc ' + r))

Calling these using /usr/bin/time -f %M which measures the maximum memory usage of the process on a 28G file we get:

$ /usr/bin/time -f %M ./unzip-file.py $file 1
(3037.2604110240936, 1223422024)
5116

$ /usr/bin/time -f %M ./unzip-file.py $file 2
(598.771901845932, 1223422024)
4996

Showing that the system call is about five times faster (10 minutes compared to 50 minutes) using basically the same maximum memory. It is also worth noting that depending on what you are doing per line reading in the file might not be the limiting factor, in which case the option you take does not matter.

Answered By: Cão
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.