Python: Read compressed (.gz) HDF file without writing and saving uncompressed file

Question:

I have a large number of compressed HDF files, which I need to read.

file1.HDF.gz
file2.HDF.gz
file3.HDF.gz
...

I can read in uncompressed HDF files with the following method

from pyhdf.SD import SD, SDC
import os

os.system('gunzip < file1.HDF.gz >  file1.HDF')
HDF = SD('file1.HDF')

and repeat this for each file. However, this is more time consuming than I want.

I’m thinking its possible that most of the time overhang comes from writing the compressed file to a new uncompressed version, and that I could speed it up if I simply was able to read an uncompressed version of the file into the SD function in one step.

Am I correct in this thinking? And if so, is there a way to do what I want?

Asked By: hm8

||

Answers:

sascha is correct that hdf transparent compression is more adequate than gzipping, nonetheless if you can’t control how the hdf files are stored you’re looking for the gzip python modulue (docs) it can get the data from these files.

Answered By: chicocvenancio

According to the pyhdf package documentation, this is not possible.

__init__(self, path, mode=1)
  SD constructor. Initialize an SD interface on an HDF file,
  creating the file if necessary.

There is no other way to instantiate an SD object that takes a file-like object. This is likely because they are conforming to an external interface (NCSA HDF). The HDF format also normally handles massive files that are impractical to store in memory at one time.

Unzipping it as a file is likely your most performant option.

If you would like to stay in Python, use the gzip module (docs):

import gzip
import shutil
with gzip.open('file1.HDF.gz', 'rb') as f_in, open('file1.HDF', 'wb') as f_out:
    shutil.copyfileobj(f_in, f_out)
Answered By: Kevin McDonough
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.