Python: How to handle a corrupted gzip file in reading multiple files

Question:

I am reading a large set of gzip files. When I tried the below code, the process cannot be finished because some of files are corrupted. Python can open those corrupted files, but the process is interrupted due to errors in certain lines in those files.

    for file in files:
        try:
            fin=gzip.open(file,'rb')
        except:
            continue
        
        for line in fin:
            try:
                temp=line.decode().split(",")
                a,b,c,d=temp[0],int(temp[1]),int(temp[2]),int(temp[3])
            except:
                continue

But the program stops because of the following error.
What is the best way to process a corrupted gzip file?

Traceback (most recent call last):---------------------------| 9.0% Complete
  File "/opt/anaconda3/lib/python3.7/gzip.py", line 374, in readline
    return self._buffer.readline(size)
  File "/opt/anaconda3/lib/python3.7/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/opt/anaconda3/lib/python3.7/gzip.py", line 463, in read
    if not self._read_gzip_header():
  File "/opt/anaconda3/lib/python3.7/gzip.py", line 411, in _read_gzip_header
    raise OSError('Not a gzipped file (%r)' % magic)
OSError: Not a gzipped file (b'rv')

I have modified the code as below, and seems like running well, but not sure if this is the best way to handle such cases.
Because for certain cases, the program seems not terminate (I need to test more).

    for file in files:
        try:
            fin=gzip.open(file,'rb')
        except:
            continue
        
        line=True
        while line:
            try:
                line=fin.readline()
            except:
                continue
            try:
                temp=line.decode().split(",")
                a,b,c,d=temp[0],int(temp[1]),int(temp[2]),int(temp[3])
            except:
                continue
Asked By: notilas

||

Answers:

I have splitted the file processing part into a separate function to handles exceptional cases during processing each file.

def proc_file(file):
    try:
        fin=gzip.open(file,'rb')
    except:
        return
        
    err_cnt=0

    while err_cnt<10:
        try:
            line=fin.readline()
        except:
            err_cnt+=1
            continue
        if not line:
            err_cnt+=1
            continue
        try:
            temp=line.decode().split(",")
            a,b,c,d=temp[0],int(temp[1]),int(temp[2]),int(temp[3])
        except:
            continue
    return processed_value

for file in files:

    result=[]
    try:
        value=proc_file(file)
    except:
        continue
    
    result.append(value)

Answered By: notilas

Your iteration over the file is outside of any tryexcept, so an exception raised here will terminate the program. If you have a single try…except around the whole thing, then it should work:

    for file in files:
        try:
            with gzip.open(file,'rb') as fin:
                for line in fin:
                    temp = line.decode().split(",")
                    a,b,c,d = temp[0], int(temp[1]), int(temp[2]), int(temp[3])
        except (OSError, ValueError):
            continue

Note also:

  • Only catching the specific exceptions that we would expect to occur with a bad file, not other things that should still terminate the program (e.g. KeyboardInterrupt). A bare except: is usually a bad idea.
  • It is better to use a with construct with gzip.open
Answered By: alani
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.