Minimize the size of a file while saving a pandas dataframe

Question:

I want to write a pandas dataframe to a file. I have about 200MB of csv data. Which file extension should I write to such that the file size is the minimum?

I am open to writing in binary as well as I will only be using the dataframe to work.

UPDATE: In my case using the compressed zip format worked the best (storage wise). But run time wise the pickle format(.pkl) was read and saved the fastest. I have not tried paraquet and feather due the additional dependencies it required.

Asked By: optimistic-orange

||

Answers:

Using standard Pandas library, pickle binary is the way to go. For a detailed information, you might find the following video to be useful

https://www.youtube.com/watch?v=u4rsA5ZiTls&t=150s

Answered By: Yutaro Watanabe

Writing to a parquet file may be a good option. Requires either pyarrow or fastparquet libraries. See documentation here


    df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})
    df.to_parquet('df.parquet.gzip',
                  compression='gzip')  
    pd.read_parquet('df.parquet.gzip')

Parquet files can achieve high compression rates.

Answered By: Dionisauce

You can simply compress your csv, using .zip extension instead of .csv:

# A zip archive with only one file
df.to_csv('export.zip')

# Or to get more control
df.to_csv('export.zip', compression={'method': 'zip', 'compresslevel': 9})

# You can read the file with
df = pd.read_csv('export.zip')
Answered By: Corralien

If you are saving your data to csv files, then pandas already has a built in compression keyword (doc)

you can use it like this:

df.to_csv("my_data.csv.zip", compression="zip")
Answered By: Nullman

I created a test data frame which has a pseudo-panel-like format. Obviously, the extent of your compression etc will always depend on your data. If your data are literally the same thing repeated over and over again, compression ratios will be high. If your data never repeat, compression ratios will be low.

To get answers for your data, take a sample of your data with df.sample(10_000) (or something like that) and execute code like mine below which saves it in different formats. Then compare the sizes.

import random
df = pd.DataFrame({
    'd': range(0, 10_000),
    's': [random.choice(['alpha', 'beta', 'gamma', 'delta'])
          for _ in range(0, 10_000)],
    'i': [random.randint(0, 1000) for _ in range(0, 10_000)]
})

I then queried the length of the following save formats.

l = []
for p in ['.csv', '.csv.gz', '.csv.xz', '.csv.bz2', '.csv.zip']:
    df.to_csv('temp' + p)
    l.append({'name': 'temp' + p, 'size': getsize('temp' + p)})

for p in ['.pkl', '.pkl.gz', '.pkl.xz', '.pkl.bz2']:
    df.to_pickle('temp' + p)
    l.append({'name': 'temp' + p, 'size': getsize('temp' + p)})

for p in ['.xls', '.xlsx']:
    df.to_excel('temp' + p)
    l.append({'name': 'temp' + p, 'size': getsize('temp' + p)})
    
for p in ['.dta', '.dta.gz', '.dta.xz', '.dta.bz2']:
    df.to_stata('temp' + p)
    l.append({'name': 'temp' + p, 'size': getsize('temp' + p)})

cr = pd.DataFrame(l)
cr['ratio'] = cr['size'] / cr.loc[0, 'size']
cr.sort_values('ratio', inplace=True)

That yielded the following table:

            name    size     ratio
7    temp.pkl.xz   22532  0.110395
8   temp.pkl.bz2   23752  0.116372
13   temp.dta.xz   39276  0.192431
6    temp.pkl.gz   40619  0.199011
2    temp.csv.xz   42332  0.207404
14  temp.dta.bz2   51694  0.253273
3   temp.csv.bz2   54801  0.268495
12   temp.dta.gz   57513  0.281783
1    temp.csv.gz   70219  0.344035
4   temp.csv.zip   70837  0.347063
11      temp.dta  170912  0.837377
5       temp.pkl  180865  0.886141
0       temp.csv  204104  1.000000
10     temp.xlsx  216828  1.062341
9       temp.xls  711168  3.484341

I did not try to_parquet or to_feather because they require dependency pyarrow, which is non-standard in Anaconda.

Running the export to Excel 2003’s format threw a warning that xlwt is no longer maintained and will be removed. Inasmuch as its Python implementation’s file size is so huge, it is of no major loss.

Answered By: ifly6
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.