Read .tar.gz file in Python

Question:

I have a text file of 25GB. so i compressed it to tar.gz and it became 450 MB. now i want to read that file from python and process the text data.for this i referred question . but in my case code doesn’t work. the code is as follows :

import tarfile
import numpy as np 

tar = tarfile.open("filename.tar.gz", "r:gz")
for member in tar.getmembers():
     f=tar.extractfile(member)
     content = f.read()
     Data = np.loadtxt(content)

the error is as follows :

Traceback (most recent call last):
  File "dataExtPlot.py", line 21, in <module>
    content = f.read()
AttributeError: 'NoneType' object has no attribute 'read'

also, Is there any other method to do this task ?

Asked By: KrunalParmar

||

Answers:

The docs tell us that None is returned by extractfile() if the member is a not a regular file or link.

One possible solution is to skip over the None results:

tar = tarfile.open("filename.tar.gz", "r:gz")
for member in tar.getmembers():
     f = tar.extractfile(member)
     if f is not None:
         content = f.read()
Answered By: Raymond Hettinger

tarfile.extractfile() can return None if the member is neither a file nor a link. For example your tar archive might contain directories or device files. To fix:

import tarfile
import numpy as np 

tar = tarfile.open("filename.tar.gz", "r:gz")
for member in tar.getmembers():
     f = tar.extractfile(member)
     if f:
         content = f.read()
         Data = np.loadtxt(content)
Answered By: mhawke

You may try this one

t = tarfile.open("filename.gz", "r")
for filename in t.getnames():
    try:
        f = t.extractfile(filename)
        Data = f.read()
        print filename, ':', Data
    except :
        print 'ERROR: Did not find %s in tar archive' % filename
Answered By: VICTOR

You cannot “read” the content of some special files such as links yet tar supports them and tarfile will extract them alright. When tarfile extracts them, it does not return a file-like object but None. And you get an error because your tarball contains such a special file.

One approach is to determine the type of an entry in a tarball you are processing ahead of extracting it: with this information at hand you can decide whether or not you can “read” the file. You can achieve this by calling tarfile.getmembers() returns tarfile.TarInfos that contain detailed information about the type of file contained in the tarball.

The tarfile.TarInfo class has all the attributes and methods you need to determine the type of tar member such as isfile() or isdir() or tinfo.islnk() or tinfo.issym() and then accordingly decide what do to with each member (extract or not, etc).

For instance I use these to test the type of file in this patched tarfile to skip extracting special files and process links in a special way:

for tinfo in tar.getmembers():
    is_special = not (tinfo.isfile() or tinfo.isdir()
                      or tinfo.islnk() or tinfo.issym())
...
Answered By: Philippe Ombredanne

In Jupyter notebook you can do like below

!wget -c http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz -O - | tar -xz
Answered By: Jadli

My needs:

  1. Python3.
  2. My tar.gz file consists of multiple utf-8 text files and dir.
  3. Need to read text lines from all files.

Problems:

  1. The tar object returned by tar.getmembers() maybe None.
  2. The content extractfile(fname) returns is a bytes str (e.g. b’Hellotxe4xbdxa0xe5xa5xbd’). Unicode char doesn’t display correctly.

Solutions:

  1. Check the type of tar object first. I reference the example in doc of tarfile lib. (Search “How to read a gzip compressed tar archive and display some member information”)
  2. Decode from byte str to normal str. (ref – most voted answer)

Code:

with tarfile.open("sample.tar.gz", "r:gz") as tar:
for tarinfo in tar:
    logger.info(f"{tarinfo.name} is {tarinfo.size} bytes in size and is: ")
    if tarinfo.isreg():
        logger.info(f"Is regular file: {tarinfo.name}")
        f = tar.extractfile(tarinfo.name)  
        # To get the str instead of bytes str
        # Decode with proper coding, e.g. utf-8
        content = f.read().decode('utf-8', errors='ignore')
        # Split the long str into lines
        # Specify your line-sep: e.g. n
        lines = content.split('n')
        for i, line in enumerate(lines):
            print(f"[{i}]: {line}n")
    elif tarinfo.isdir():
        logger.info(f"Is dir: {tarinfo.name}")
    else:
        logger.info(f"Is something else: {tarinfo.name}.")
Answered By: MonkandMonkey
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.