Fastest file format for read/write operations with Pandas and/or Numpy

Question:

I’ve been working for a while with very large DataFrames and I’ve been using the csv format to store input data and results. I’ve noticed that a lot of time goes into reading and writing these files which, for example, dramatically slows down batch processing of data. I was wondering if the file format itself is of relevance. Is there a
preferred file format for faster reading/writing Pandas DataFrames and/or Numpy arrays?

Asked By: c_david

||

Answers:

Use HDF5. Beats writing flat files hands down. And you can query. Docs are here

Here’s a perf comparison vs SQL. Updated to show SQL/HDF_fixed/HDF_table/CSV write and read perfs.

Docs now include a performance section:

See here

Answered By: Jeff

It’s always a good idea to run some benchmarks for your use case. I’ve had good results storing raw structs via numpy:

df.to_records().astype(mytype).tofile('mydata')
df = pd.DataFrame.from_records(np.fromfile('mydata', dtype=mytype))

It is quite fast and takes up less space on the disk. But: you’ll need to keep track of the dtype to reload the data, it’s not portable between architectures, and it doesn’t support the advanced features of HDF5. (numpy has a more advanced binary format which is designed to overcome the first two limitations, but I haven’t had much success getting it to work.)

Update: Thanks for pressing me for numbers. My benchmark indicates that indeed HDF5 wins, at least in my case. It’s both faster and smaller on disk! Here’s what I see with dataframe of about 280k rows, 7 float columns, and a string index:

In [15]: %timeit df.to_hdf('test_fixed.hdf', 'test', mode='w')
10 loops, best of 3: 172 ms per loop
In [17]: %timeit df.to_records().astype(mytype).tofile('raw_data')
1 loops, best of 3: 283 ms per loop
In [20]: %timeit pd.read_hdf('test_fixed.hdf', 'test')
10 loops, best of 3: 36.9 ms per loop
In [22]: %timeit pd.DataFrame.from_records(np.fromfile('raw_data', dtype=mytype))
10 loops, best of 3: 40.7 ms per loop
In [23]: ls -l raw_data test_fixed.hdf
-rw-r----- 1 altaurog altaurog 18167232 Apr  8 12:42 raw_data
-rw-r----- 1 altaurog altaurog 15537704 Apr  8 12:41 test_fixed.hdf
Answered By: Aryeh Leib Taurog

HDF is indeed a very good choice, you can also use npy/npz with some caveats:

Here is a benchmark using a data frame of 25k rows and 1000 columns filled with random floats:

Saving to HDF took 0.49s
Saving to npy took 0.40s
Loading from HDF took 0.10s
Loading from npy took 0.061s

npy is about 20% faster to write and about 40% faster to read if you don’t compress data.

Code used to generate the output above:

#!/usr/bin/python3

import pandas as pd
import random
import numpy as np
import time

start = time.time()
f = pd.DataFrame()
for i in range(1000):
  f['col_{}'.format(i)] = np.random.rand(25000)
print('Generating data took {}s'.format(time.time() - start))

start = time.time()
f.to_hdf('frame.hdf', 'main', format='fixed')
print('Saving to HDF took {}s'.format(time.time() - start))

start = time.time()
np.savez('frame.npz', f.index, f.values)
print('Saving to npy took {}s'.format(time.time() - start))

start = time.time()
pd.read_hdf('frame.hdf')
print('Loading from HDF took {}s'.format(time.time() - start))

start = time.time()
index, values = np.load('frame.npz')
pd.DataFrame(values, index=index)
print('Loading from npy took {}s'.format(time.time() - start))
Answered By: rahenri

Recently pandas added support for the parquet format using as backend the library pyarrow (written by Wes Mckinney himself, with his usual obsession for performance).

You only need to install the pyarrow library and use the methods read_parquet and to_parquet. Parquet is much faster to read and write for bigger datasets (above a few hundred megabytes or more) and it also keep track of dtype metadata, so you won’t loose data type information when writing and reading from disk. It can actually store more efficiently some datatypes that HDF5 are not very performant with (like strings and timestamps: HDF5 doesn’t have a native data type for those, so it uses pickle to serialize them, which makes slow for big datasets).

Parquet is also a columnar format, which makes it very easy to do two things:

  • Fastly filter out columns that you’re not interested in. With CSV you have to actually read the whole file and only after that you can throw away columns you don’t want. With parquet you can actualy read only the columns you’re interested.

  • Make queries filtering out rows and reading only what you care.

Another interesting recent development is the Feather file format, which is also developed by Wes Mckinney. It’s essentially just an uncompressed arrow format written directly to disk, so it is potentially faster to write than the Parquet format. The disadvantage will be files that are 2-3x larger.

If the priority is speed I would recommend:

  • feather – the fastest
  • parquet – a bit slower, but saves lots of disk space
Answered By: artoby
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.