reading multiple files contained in a zip file with pandas
Question:
I have multiple zip files containing different types of txt files.
Like below:
zip1
- file1.txt
- file2.txt
- file3.txt
How can I use pandas to read in each of those files without extracting them?
I know if they were 1 file per zip I could use the compression method with read_csv like below:
df = pd.read_csv(textfile.zip, compression='zip')
Any help on how to do this would be great.
Answers:
I had a similar problem with XML files awhile ago. The zipfile module can get you there.
from zipfile import ZipFile
z = ZipFile(yourfile)
text_files = z.infolist()
for text_file in text_files:
z.read(text_file.filename)
If you want to concatenate them into a pandas object then it might get a bit more complex, but that should get you started. Note that the read
method returns bytes, so you may have to handle that as well.
You can pass ZipFile.open()
to pandas.read_csv()
to construct a pandas.DataFrame
from a csv-file packed into a multi-file zip
.
Code:
pd.read_csv(zip_file.open('file3.txt'))
Example to read all .csv
into a dict:
from zipfile import ZipFile
zip_file = ZipFile('textfile.zip')
dfs = {text_file.filename: pd.read_csv(zip_file.open(text_file.filename))
for text_file in zip_file.infolist()
if text_file.filename.endswith('.csv')}
The most simplest way to handle this (if you have multiple parts of one big csv file compressed to a one zip file).
import pandas as pd
from zipfile import ZipFile
df = pd.concat(
[pd.read_csv(ZipFile('some.zip').open(i)) for i in ZipFile('some.zip').namelist()],
ignore_index=True
)
For those who have empty txt files in the zipfile:
from zipfile import ZipFile
z = ZipFile('textfile.zip')
df = pd.concat(
[pd.read_csv(z.open(i.filename)) for i in z.infolist() if i.compress_size > 0],
ignore_index=True)
Otherwise, the "pandas.errors.EmptyDataError: No columns to parse from file" would show up.
I have multiple zip files containing different types of txt files.
Like below:
zip1
- file1.txt
- file2.txt
- file3.txt
How can I use pandas to read in each of those files without extracting them?
I know if they were 1 file per zip I could use the compression method with read_csv like below:
df = pd.read_csv(textfile.zip, compression='zip')
Any help on how to do this would be great.
I had a similar problem with XML files awhile ago. The zipfile module can get you there.
from zipfile import ZipFile
z = ZipFile(yourfile)
text_files = z.infolist()
for text_file in text_files:
z.read(text_file.filename)
If you want to concatenate them into a pandas object then it might get a bit more complex, but that should get you started. Note that the read
method returns bytes, so you may have to handle that as well.
You can pass ZipFile.open()
to pandas.read_csv()
to construct a pandas.DataFrame
from a csv-file packed into a multi-file zip
.
Code:
pd.read_csv(zip_file.open('file3.txt'))
Example to read all .csv
into a dict:
from zipfile import ZipFile
zip_file = ZipFile('textfile.zip')
dfs = {text_file.filename: pd.read_csv(zip_file.open(text_file.filename))
for text_file in zip_file.infolist()
if text_file.filename.endswith('.csv')}
The most simplest way to handle this (if you have multiple parts of one big csv file compressed to a one zip file).
import pandas as pd
from zipfile import ZipFile
df = pd.concat(
[pd.read_csv(ZipFile('some.zip').open(i)) for i in ZipFile('some.zip').namelist()],
ignore_index=True
)
For those who have empty txt files in the zipfile:
from zipfile import ZipFile
z = ZipFile('textfile.zip')
df = pd.concat(
[pd.read_csv(z.open(i.filename)) for i in z.infolist() if i.compress_size > 0],
ignore_index=True)
Otherwise, the "pandas.errors.EmptyDataError: No columns to parse from file" would show up.