What is the fastest way to serialize a DataFrame besides to_pickle?

Question:

I need to serialize DataFrames and send them over the wire. For security reasons, I cannot use pickle.

What would be the next fastest way to do this? I was intrigued by msgpacks in v0.13, but unless I’m doing something wrong, the performance seems much worse than with pickle.

In [107]: from pandas.io.packers import pack

In [108]: df = pd.DataFrame(np.random.rand(1000, 100))

In [109]: %timeit buf = pack(df)
100 loops, best of 3: 15.5 ms per loop

In [110]: import pickle

In [111]: %timeit buf = pickle.dumps(df)
1000 loops, best of 3: 241 µs per loop

The best I’ve found so far is just serializing homogenous numpy arrays (df.as_blocks() was handy) using array.tostring() and rebuilding the DataFrame from them. The performance is comparable to pickle.

However, with this approach, I am forced to convert columns of dtype=object (i.e., anything with at least a string) to be entirely string since Numpy’s fromstring() cannot deserialize dtype=object. Pickle manages to preserve mixed types in object columns (it seems to be including some function in the pickle output).

Asked By: capitalistcuttle

||

Answers:

This is now pretty competetive with this PR: https://github.com/pydata/pandas/pull/5498 (going to merge for 0.13 shortly)

In [1]: from pandas.io.packers import pack

In [2]: import cPickle as pkl

In [3]: df = pd.DataFrame(np.random.rand(1000, 100))

Above example

In [6]: %timeit buf = pack(df)
1000 loops, best of 3: 492 µs per loop

In [7]: %timeit buf = pkl.dumps(df,pkl.HIGHEST_PROTOCOL)
1000 loops, best of 3: 681 µs per loop

Much bigger frame

In [8]: df = pd.DataFrame(np.random.rand(100000, 100))

In [9]:  %timeit buf = pack(df)
10 loops, best of 3: 192 ms per loop

In [10]: %timeit buf = pkl.dumps(df,pkl.HIGHEST_PROTOCOL)
10 loops, best of 3: 119 ms per loop

Another option is to use an in-memory hdf file

See here: http://pytables.github.io/cookbook/inmemory_hdf5_files.html; no support yet in pandas to add the driver arg (could be done by pretty simply by monkey-patching though).

Another possibity a ctable, see https://github.com/FrancescAlted/carray. Not supported yet in pandas ATM though.

Answered By: Jeff

Another option is to serialize it using a binary file format to save tabular data which is written in Cython for performance: BinTableFile. Project Github: https://github.com/eSAMTrade/bintablefile

Answered By: asu