Can pandas read and archive within an archive?
Question:
I have an archive file (archive.tar.gz) which contains multiple archive files (file.txt.gz).
If I first extract the .txt.gz files to a folder, I can then open them with pandas directly using:
import pandas as pd
df = pd.read_csv('file.txt.gz', sep='t', encoding='utf-8')
But if I explore the archive using the tarfile library, then it doesn’t work:
import pandas as pd
import tarfile
tar = tarfile.open("archive.tar.gz", "r:*")
csv_path = tar.getnames()[1]
df = pd.read_csv(tar.extractfile(csv_path), sep='t', encoding='utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
Is that possible to do?
Answers:
read_csv
is probably trying to interpret the input as a filename. If you wrap the extracted file in io.BytesIO
, I suspect you should be able to get it to treat it as it would an open file handle
from io import BytesIO
df = pd.read_csv(BytesIO(tar.extractfile(csv_path)), ...)
When you open the file by filename, then Pandas will be able to infer that it is compressed with gzip due to the *.gz
extension on the filename.
When you pass it a file object, you need to tell it explicitly about the compression so that it can decompress it as it reads the file.
This should work:
df = pd.read_csv(
tar.extractfile(csv_path),
compression='gzip',
sep='t',
encoding='utf-8')
For more details, see the entry about the “compression” argument in the documentation for read_csv().
A bit late but I had the same requirement and the following solution works. Two small changes – you have to read the extracted file tar.extractfile(xx).read()
and pass it to BytesIO()
:
from io import BytesIO
tar = tarfile.open("archive.tar.gz", "r:gz")
csv_path = tar.getnames()[1]
csv_bytes = BytesIO(tar.extractfile(csv_path).read())
df = pd.read_csv(csv_bytes, sep='t', encoding='utf-8')
I have an archive file (archive.tar.gz) which contains multiple archive files (file.txt.gz).
If I first extract the .txt.gz files to a folder, I can then open them with pandas directly using:
import pandas as pd
df = pd.read_csv('file.txt.gz', sep='t', encoding='utf-8')
But if I explore the archive using the tarfile library, then it doesn’t work:
import pandas as pd
import tarfile
tar = tarfile.open("archive.tar.gz", "r:*")
csv_path = tar.getnames()[1]
df = pd.read_csv(tar.extractfile(csv_path), sep='t', encoding='utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
Is that possible to do?
read_csv
is probably trying to interpret the input as a filename. If you wrap the extracted file in io.BytesIO
, I suspect you should be able to get it to treat it as it would an open file handle
from io import BytesIO
df = pd.read_csv(BytesIO(tar.extractfile(csv_path)), ...)
When you open the file by filename, then Pandas will be able to infer that it is compressed with gzip due to the *.gz
extension on the filename.
When you pass it a file object, you need to tell it explicitly about the compression so that it can decompress it as it reads the file.
This should work:
df = pd.read_csv(
tar.extractfile(csv_path),
compression='gzip',
sep='t',
encoding='utf-8')
For more details, see the entry about the “compression” argument in the documentation for read_csv().
A bit late but I had the same requirement and the following solution works. Two small changes – you have to read the extracted file tar.extractfile(xx).read()
and pass it to BytesIO()
:
from io import BytesIO
tar = tarfile.open("archive.tar.gz", "r:gz")
csv_path = tar.getnames()[1]
csv_bytes = BytesIO(tar.extractfile(csv_path).read())
df = pd.read_csv(csv_bytes, sep='t', encoding='utf-8')