Can pandas read and archive within an archive?

Question:

I have an archive file (archive.tar.gz) which contains multiple archive files (file.txt.gz).

If I first extract the .txt.gz files to a folder, I can then open them with pandas directly using:

import pandas as pd

df = pd.read_csv('file.txt.gz', sep='t', encoding='utf-8')

But if I explore the archive using the tarfile library, then it doesn’t work:

import pandas as pd
import tarfile

tar = tarfile.open("archive.tar.gz", "r:*")
csv_path = tar.getnames()[1]
df = pd.read_csv(tar.extractfile(csv_path), sep='t', encoding='utf-8')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

Is that possible to do?

Asked By: yoann

||

Answers:

read_csv is probably trying to interpret the input as a filename. If you wrap the extracted file in io.BytesIO, I suspect you should be able to get it to treat it as it would an open file handle

from io import BytesIO
df = pd.read_csv(BytesIO(tar.extractfile(csv_path)), ...)
Answered By: Randy

When you open the file by filename, then Pandas will be able to infer that it is compressed with gzip due to the *.gz extension on the filename.

When you pass it a file object, you need to tell it explicitly about the compression so that it can decompress it as it reads the file.

This should work:

df = pd.read_csv(
    tar.extractfile(csv_path),
    compression='gzip',
    sep='t',
    encoding='utf-8')

For more details, see the entry about the “compression” argument in the documentation for read_csv().

Answered By: filbranden

A bit late but I had the same requirement and the following solution works. Two small changes – you have to read the extracted file tar.extractfile(xx).read() and pass it to BytesIO():

from io import BytesIO

tar = tarfile.open("archive.tar.gz", "r:gz")
csv_path = tar.getnames()[1]
csv_bytes = BytesIO(tar.extractfile(csv_path).read())
df = pd.read_csv(csv_bytes, sep='t', encoding='utf-8')
Answered By: Florian
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.