Minimize the size of a file while saving a pandas dataframe
Question:
I want to write a pandas dataframe to a file. I have about 200MB of csv data. Which file extension should I write to such that the file size is the minimum?
I am open to writing in binary as well as I will only be using the dataframe to work.
UPDATE: In my case using the compressed zip format worked the best (storage wise). But run time wise the pickle format(.pkl) was read and saved the fastest. I have not tried paraquet and feather due the additional dependencies it required.
Answers:
Using standard Pandas library, pickle binary is the way to go. For a detailed information, you might find the following video to be useful
Writing to a parquet file may be a good option. Requires either pyarrow or fastparquet libraries. See documentation here
df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})
df.to_parquet('df.parquet.gzip',
compression='gzip')
pd.read_parquet('df.parquet.gzip')
Parquet files can achieve high compression rates.
You can simply compress your csv, using .zip
extension instead of .csv
:
# A zip archive with only one file
df.to_csv('export.zip')
# Or to get more control
df.to_csv('export.zip', compression={'method': 'zip', 'compresslevel': 9})
# You can read the file with
df = pd.read_csv('export.zip')
If you are saving your data to csv files, then pandas already has a built in compression
keyword (doc)
you can use it like this:
df.to_csv("my_data.csv.zip", compression="zip")
I created a test data frame which has a pseudo-panel-like format. Obviously, the extent of your compression etc will always depend on your data. If your data are literally the same thing repeated over and over again, compression ratios will be high. If your data never repeat, compression ratios will be low.
To get answers for your data, take a sample of your data with df.sample(10_000)
(or something like that) and execute code like mine below which saves it in different formats. Then compare the sizes.
import random
df = pd.DataFrame({
'd': range(0, 10_000),
's': [random.choice(['alpha', 'beta', 'gamma', 'delta'])
for _ in range(0, 10_000)],
'i': [random.randint(0, 1000) for _ in range(0, 10_000)]
})
I then queried the length of the following save formats.
l = []
for p in ['.csv', '.csv.gz', '.csv.xz', '.csv.bz2', '.csv.zip']:
df.to_csv('temp' + p)
l.append({'name': 'temp' + p, 'size': getsize('temp' + p)})
for p in ['.pkl', '.pkl.gz', '.pkl.xz', '.pkl.bz2']:
df.to_pickle('temp' + p)
l.append({'name': 'temp' + p, 'size': getsize('temp' + p)})
for p in ['.xls', '.xlsx']:
df.to_excel('temp' + p)
l.append({'name': 'temp' + p, 'size': getsize('temp' + p)})
for p in ['.dta', '.dta.gz', '.dta.xz', '.dta.bz2']:
df.to_stata('temp' + p)
l.append({'name': 'temp' + p, 'size': getsize('temp' + p)})
cr = pd.DataFrame(l)
cr['ratio'] = cr['size'] / cr.loc[0, 'size']
cr.sort_values('ratio', inplace=True)
That yielded the following table:
name size ratio
7 temp.pkl.xz 22532 0.110395
8 temp.pkl.bz2 23752 0.116372
13 temp.dta.xz 39276 0.192431
6 temp.pkl.gz 40619 0.199011
2 temp.csv.xz 42332 0.207404
14 temp.dta.bz2 51694 0.253273
3 temp.csv.bz2 54801 0.268495
12 temp.dta.gz 57513 0.281783
1 temp.csv.gz 70219 0.344035
4 temp.csv.zip 70837 0.347063
11 temp.dta 170912 0.837377
5 temp.pkl 180865 0.886141
0 temp.csv 204104 1.000000
10 temp.xlsx 216828 1.062341
9 temp.xls 711168 3.484341
I did not try to_parquet
or to_feather
because they require dependency pyarrow
, which is non-standard in Anaconda.
Running the export to Excel 2003’s format threw a warning that xlwt
is no longer maintained and will be removed. Inasmuch as its Python implementation’s file size is so huge, it is of no major loss.
I want to write a pandas dataframe to a file. I have about 200MB of csv data. Which file extension should I write to such that the file size is the minimum?
I am open to writing in binary as well as I will only be using the dataframe to work.
UPDATE: In my case using the compressed zip format worked the best (storage wise). But run time wise the pickle format(.pkl) was read and saved the fastest. I have not tried paraquet and feather due the additional dependencies it required.
Using standard Pandas library, pickle binary is the way to go. For a detailed information, you might find the following video to be useful
Writing to a parquet file may be a good option. Requires either pyarrow or fastparquet libraries. See documentation here
df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})
df.to_parquet('df.parquet.gzip',
compression='gzip')
pd.read_parquet('df.parquet.gzip')
Parquet files can achieve high compression rates.
You can simply compress your csv, using .zip
extension instead of .csv
:
# A zip archive with only one file
df.to_csv('export.zip')
# Or to get more control
df.to_csv('export.zip', compression={'method': 'zip', 'compresslevel': 9})
# You can read the file with
df = pd.read_csv('export.zip')
If you are saving your data to csv files, then pandas already has a built in compression
keyword (doc)
you can use it like this:
df.to_csv("my_data.csv.zip", compression="zip")
I created a test data frame which has a pseudo-panel-like format. Obviously, the extent of your compression etc will always depend on your data. If your data are literally the same thing repeated over and over again, compression ratios will be high. If your data never repeat, compression ratios will be low.
To get answers for your data, take a sample of your data with df.sample(10_000)
(or something like that) and execute code like mine below which saves it in different formats. Then compare the sizes.
import random
df = pd.DataFrame({
'd': range(0, 10_000),
's': [random.choice(['alpha', 'beta', 'gamma', 'delta'])
for _ in range(0, 10_000)],
'i': [random.randint(0, 1000) for _ in range(0, 10_000)]
})
I then queried the length of the following save formats.
l = []
for p in ['.csv', '.csv.gz', '.csv.xz', '.csv.bz2', '.csv.zip']:
df.to_csv('temp' + p)
l.append({'name': 'temp' + p, 'size': getsize('temp' + p)})
for p in ['.pkl', '.pkl.gz', '.pkl.xz', '.pkl.bz2']:
df.to_pickle('temp' + p)
l.append({'name': 'temp' + p, 'size': getsize('temp' + p)})
for p in ['.xls', '.xlsx']:
df.to_excel('temp' + p)
l.append({'name': 'temp' + p, 'size': getsize('temp' + p)})
for p in ['.dta', '.dta.gz', '.dta.xz', '.dta.bz2']:
df.to_stata('temp' + p)
l.append({'name': 'temp' + p, 'size': getsize('temp' + p)})
cr = pd.DataFrame(l)
cr['ratio'] = cr['size'] / cr.loc[0, 'size']
cr.sort_values('ratio', inplace=True)
That yielded the following table:
name size ratio
7 temp.pkl.xz 22532 0.110395
8 temp.pkl.bz2 23752 0.116372
13 temp.dta.xz 39276 0.192431
6 temp.pkl.gz 40619 0.199011
2 temp.csv.xz 42332 0.207404
14 temp.dta.bz2 51694 0.253273
3 temp.csv.bz2 54801 0.268495
12 temp.dta.gz 57513 0.281783
1 temp.csv.gz 70219 0.344035
4 temp.csv.zip 70837 0.347063
11 temp.dta 170912 0.837377
5 temp.pkl 180865 0.886141
0 temp.csv 204104 1.000000
10 temp.xlsx 216828 1.062341
9 temp.xls 711168 3.484341
I did not try to_parquet
or to_feather
because they require dependency pyarrow
, which is non-standard in Anaconda.
Running the export to Excel 2003’s format threw a warning that xlwt
is no longer maintained and will be removed. Inasmuch as its Python implementation’s file size is so huge, it is of no major loss.