Improve pandas (PyTables?) HDF5 table write performance

Question:

I’ve been using pandas for research now for about two months to great effect. With large numbers of medium-sized trace event datasets, pandas + PyTables (the HDF5 interface) does a tremendous job of allowing me to process heterogenous data using all the Python tools I know and love.

Generally speaking, I use the Fixed (formerly “Storer”) format in PyTables, as my workflow is write-once, read-many, and many of my datasets are sized such that I can load 50-100 of them into memory at a time with no serious disadvantages. (NB: I do much of my work on Opteron server-class machines with 128GB+ system memory.)

However, for large datasets (500MB and greater), I would like to be able to use the more scalable random-access and query abilities of the PyTables “Tables” format, so that I can perform my queries out-of-memory and then load the much smaller result set into memory for processing. The big hurdle here, however, is the write performance. Yes, as I said, my workflow is write-once, read-many, but the relative times are still unacceptable.

As an example, I recently ran a large Cholesky factorization that took 3 minutes, 8 seconds (188 seconds) on my 48 core machine. This generated a trace file of ~2.2 GB – the trace is generated in parallel with the program, so there is no additional “trace creation time.”

The initial conversion of my binary trace file into the pandas/PyTables format takes a decent chunk of time, but largely because the binary format is deliberately out-of-order in order to reduce the performance impact of the trace generator itself. This is also irrelevant to the performance loss when moving from the Storer format to the Table format.

My tests were initially run with pandas 0.12, numpy 1.7.1, PyTables 2.4.0, and numexpr 0.20.1. My 48 core machine runs at 2.8GHz per core, and I am writing to an ext3 filesystem which is probably (but not certainly) on a SSD.

I can write the entire dataset to a Storer format HDF5 file (resulting filesize: 3.3GB) in 7.1 seconds. The same dataset, written to the Table format (resulting file size is also 3.3GB), takes 178.7 seconds to write.

The code is as follows:

with Timer() as t:
    store = pd.HDFStore('test_storer.h5', 'w')
    store.put('events', events_dataset, table=False, append=False)
print('Fixed format write took ' + str(t.interval))
with Timer() as t:
    store = pd.HDFStore('test_table.h5', 'w')
    store.put('events', events_dataset, table=True, append=False)
print('Table format write took ' + str(t.interval))

and the output is simply

Fixed format write took 7.1
Table format write took 178.7

My dataset has 28,880,943 rows, and the columns are basic datatypes:

node_id           int64
thread_id         int64
handle_id         int64
type              int64
begin             int64
end               int64
duration          int64
flags             int64
unique_id         int64
id                int64
DSTL_LS_FULL    float64
L2_DMISS        float64
L3_MISS         float64
kernel_type     float64
dtype: object

…so I don’t think there should be any data-specific issues with the write speed.

I’ve also tried adding BLOSC compression, to rule out any strange I/O issues that might affect one scenario or the other, but compression seems to decrease the performance of both equally.

Now, I realize that the pandas documentation says that the Storer format offers significantly faster writes, and slightly faster reads. (I do experience the faster reads, as a read of the Storer format seems to take around 2.5 seconds, while a read of the Table format takes around 10 seconds.) But it really seems excessive that the Table format write should take 25 times as long as the Storer format write.

Can any of the folks involved with PyTables or pandas explain the architectural (or otherwise) reasons why writing to the queryable format (which clearly requires very little extra data) should take an order of magnitude longer? And is there any hope for improving this in the future? I’d love to jump in to contributing to one project or the other, as my field is high performance computing and I see a significant use case for both projects in this domain…. but it would be helpful to get some clarification on the issues involved first, and/or some advice on how to speed things up from those who know how the system is built.

EDIT:

Running the former tests with %prun in IPython gives the following (somewhat reduced for readability) profile output for the Storer/Fixed format:

%prun -l 20 profile.events.to_hdf('test.h5', 'events', table=False, append=False)

3223 function calls (3222 primitive calls) in 7.385 seconds

Ordered by: internal time
List reduced from 208 to 20 due to restriction <20>

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    6    7.127    1.188    7.128    1.188 {method '_createArray' of 'tables.hdf5Extension.Array' objects}
    1    0.242    0.242    0.242    0.242 {method '_closeFile' of 'tables.hdf5Extension.File' objects}
    1    0.003    0.003    0.003    0.003 {method '_g_new' of 'tables.hdf5Extension.File' objects}
   46    0.001    0.000    0.001    0.000 {method 'reduce' of 'numpy.ufunc' objects}

and the following for the Tables format:

   %prun -l 40 profile.events.to_hdf('test.h5', 'events', table=True, append=False, chunksize=1000000)

   499082 function calls (499040 primitive calls) in 188.981 seconds

   Ordered by: internal time
   List reduced from 526 to 40 due to restriction <40>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       29   92.018    3.173   92.018    3.173 {pandas.lib.create_hdf_rows_2d}
      640   20.987    0.033   20.987    0.033 {method '_append' of 'tables.hdf5Extension.Array' objects}
       29   19.256    0.664   19.256    0.664 {method '_append_records' of 'tables.tableExtension.Table' objects}
      406   19.182    0.047   19.182    0.047 {method '_g_writeSlice' of 'tables.hdf5Extension.Array' objects}
    14244   10.646    0.001   10.646    0.001 {method '_g_readSlice' of 'tables.hdf5Extension.Array' objects}
      472   10.359    0.022   10.359    0.022 {method 'copy' of 'numpy.ndarray' objects}
       80    3.409    0.043    3.409    0.043 {tables.indexesExtension.keysort}
        2    3.023    1.512    3.023    1.512 common.py:134(_isnull_ndarraylike)
       41    2.489    0.061    2.533    0.062 {method '_fillCol' of 'tables.tableExtension.Row' objects}
       87    2.401    0.028    2.401    0.028 {method 'astype' of 'numpy.ndarray' objects}
       30    1.880    0.063    1.880    0.063 {method '_g_flush' of 'tables.hdf5Extension.Leaf' objects}
      282    0.824    0.003    0.824    0.003 {method 'reduce' of 'numpy.ufunc' objects}
       41    0.537    0.013    0.668    0.016 index.py:607(final_idx32)
    14490    0.385    0.000    0.712    0.000 array.py:342(_interpret_indexing)
       39    0.279    0.007   19.635    0.503 index.py:1219(reorder_slice)
        2    0.256    0.128   10.063    5.031 index.py:1099(get_neworder)
        1    0.090    0.090  119.392  119.392 pytables.py:3016(write_data)
    57842    0.087    0.000    0.087    0.000 {numpy.core.multiarray.empty}
    28570    0.062    0.000    0.107    0.000 utils.py:42(is_idx)
    14164    0.062    0.000    7.181    0.001 array.py:711(_readSlice)

EDIT 2:

Running again with a pre-release copy of pandas 0.13 (pulled Nov 20 2013 at about 11:00 EST), write times for the Tables format improve significantly but still don’t compare “reasonably” to the write speeds of the Storer/Fixed format.

%prun -l 40 profile.events.to_hdf('test.h5', 'events', table=True, append=False, chunksize=1000000)

         499748 function calls (499720 primitive calls) in 117.187 seconds

   Ordered by: internal time
   List reduced from 539 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      640   22.010    0.034   22.010    0.034 {method '_append' of 'tables.hdf5Extension.Array' objects}
       29   20.782    0.717   20.782    0.717 {method '_append_records' of 'tables.tableExtension.Table' objects}
      406   19.248    0.047   19.248    0.047 {method '_g_writeSlice' of 'tables.hdf5Extension.Array' objects}
    14244   10.685    0.001   10.685    0.001 {method '_g_readSlice' of 'tables.hdf5Extension.Array' objects}
      472   10.439    0.022   10.439    0.022 {method 'copy' of 'numpy.ndarray' objects}
       30    7.356    0.245    7.356    0.245 {method '_g_flush' of 'tables.hdf5Extension.Leaf' objects}
       29    7.161    0.247   37.609    1.297 pytables.py:3498(write_data_chunk)
        2    3.888    1.944    3.888    1.944 common.py:197(_isnull_ndarraylike)
       80    3.581    0.045    3.581    0.045 {tables.indexesExtension.keysort}
       41    3.248    0.079    3.294    0.080 {method '_fillCol' of 'tables.tableExtension.Row' objects}
       34    2.744    0.081    2.744    0.081 {method 'ravel' of 'numpy.ndarray' objects}
      115    2.591    0.023    2.591    0.023 {method 'astype' of 'numpy.ndarray' objects}
      270    0.875    0.003    0.875    0.003 {method 'reduce' of 'numpy.ufunc' objects}
       41    0.560    0.014    0.732    0.018 index.py:607(final_idx32)
    14490    0.387    0.000    0.712    0.000 array.py:342(_interpret_indexing)
       39    0.303    0.008   19.617    0.503 index.py:1219(reorder_slice)
        2    0.288    0.144   10.299    5.149 index.py:1099(get_neworder)
    57871    0.087    0.000    0.087    0.000 {numpy.core.multiarray.empty}
        1    0.084    0.084   45.266   45.266 pytables.py:3424(write_data)
        1    0.080    0.080   55.542   55.542 pytables.py:3385(write)

I noticed while running these tests that there are long periods where writing seems to “pause” (the file on disk is not actively growing), and yet there is also low CPU usage during some of these periods.

I begin to suspect that some known ext3 limitations may interact badly with either pandas or PyTables. Ext3 and other non-extent-based filesystems sometimes struggle to unlink large files promptly, and similar system performance (low CPU usage, but long wait times) is apparent even during a simple ‘rm’ of a 1GB file, for instance.

To clarify, in each test case, I made sure to remove the existing file, if any, before starting the test, so as not to incur any ext3 file removal/overwrite penalty.

However, when re-running this test with index=None, performance improves drastically (~50s vs the ~120 when indexing). So it would seem that either this process continues to be CPU-bound (my system has relatively old AMD Opteron Istanbul CPUs running @ 2.8GHz, though it does also have 8 sockets with 6 core CPUs in each, all but one of which, of course, sit idle during the write), or that there is some conflict between the way PyTables or pandas attempts to manipulate/read/analyze the file when already partially or fully on the filesystem that causes pathologically bad I/O behavior when the indexing is occurring.

EDIT 3:

@Jeff’s suggested tests on a smaller dataset (1.3 GB on disk), after upgrading PyTables from 2.4 to 3.0.0, have gotten me here:

In [7]: %timeit f(df)
1 loops, best of 3: 3.7 s per loop

In [8]: %timeit f2(df) # where chunksize= 2 000 000
1 loops, best of 3: 13.8 s per loop

In [9]: %timeit f3(df) # where chunksize= 2 000 000
1 loops, best of 3: 43.4 s per loop

In fact, my performance seems to beat his in all scenarios except for when indexing is turned on (the default). However, indexing still seems to be a killer, and if the way I’m interpreting the output from top and ls as I run these tests is correct, there remain periods of time when there is neither significant processing nor any file writing happening (i.e., CPU usage for the Python process is near 0, and the filesize remains constant). I can only assume these are file reads. Why file reads would be causing slowdowns is hard for me to understand, as I can reliably load an entire 3+ GB file from this disk into memory in under 3 seconds. If they’re not file reads, then what is the system ‘waiting’ on? (No one else is logged into the machine, and there is no other filesystem activity.)

At this point, with upgraded versions of the relevant python modules, the performance for my original dataset is down to the following figures. Of special interest are the system time, which I assume is at least an upper-bound on the time spent performing IO, and the Wall time, which seems to perhaps account for these mysterious periods of no write/no CPU activity.

In [28]: %time f(profile.events)
CPU times: user 0 ns, sys: 7.16 s, total: 7.16 s
Wall time: 7.51 s

In [29]: %time f2(profile.events)
CPU times: user 18.7 s, sys: 14 s, total: 32.7 s
Wall time: 47.2 s

In [31]: %time f3(profile.events)
CPU times: user 1min 18s, sys: 14.4 s, total: 1min 32s
Wall time: 2min 5s

Nevertheless, it would appears that indexing causes significant slowdown for my use case. Perhaps I should attempt limiting the fields indexed instead of simply performing the default case (which may very well be indexing on all of the fields in the DataFrame)? I am not sure how this is likely to affect query times, especially in the cases where a query selects based on a non-indexed field.

Per Jeff’s request, a ptdump of the resulting file.

ptdump -av test.h5
/ (RootGroup) ''
  /._v_attrs (AttributeSet), 4 attributes:
   [CLASS := 'GROUP',
    PYTABLES_FORMAT_VERSION := '2.1',
    TITLE := '',
    VERSION := '1.0']
/df (Group) ''
  /df._v_attrs (AttributeSet), 14 attributes:
   [CLASS := 'GROUP',
    TITLE := '',
    VERSION := '1.0',
    data_columns := [],
    encoding := None,
    index_cols := [(0, 'index')],
    info := {1: {'type': 'Index', 'names': [None]}, 'index': {}},
    levels := 1,
    nan_rep := 'nan',
    non_index_axes := 
    [(1, ['node_id', 'thread_id', 'handle_id', 'type', 'begin', 'end', 'duration', 'flags', 'unique_id', 'id', 'DSTL_LS_FULL', 'L2_DMISS', 'L3_MISS', 'kernel_type'])],
    pandas_type := 'frame_table',
    pandas_version := '0.10.1',
    table_type := 'appendable_frame',
    values_cols := ['values_block_0', 'values_block_1']]
/df/table (Table(28880943,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Int64Col(shape=(10,), dflt=0, pos=1),
  "values_block_1": Float64Col(shape=(4,), dflt=0.0, pos=2)}
  byteorder := 'little'
  chunkshape := (4369,)
  autoindex := True
  colindexes := {
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
  /df/table._v_attrs (AttributeSet), 15 attributes:
   [CLASS := 'TABLE',
    FIELD_0_FILL := 0,
    FIELD_0_NAME := 'index',
    FIELD_1_FILL := 0,
    FIELD_1_NAME := 'values_block_0',
    FIELD_2_FILL := 0.0,
    FIELD_2_NAME := 'values_block_1',
    NROWS := 28880943,
    TITLE := '',
    VERSION := '2.7',
    index_kind := 'integer',
    values_block_0_dtype := 'int64',
    values_block_0_kind := ['node_id', 'thread_id', 'handle_id', 'type', 'begin', 'end', 'duration', 'flags', 'unique_id', 'id'],
    values_block_1_dtype := 'float64',
    values_block_1_kind := ['DSTL_LS_FULL', 'L2_DMISS', 'L3_MISS', 'kernel_type']]

and another %prun with the updated modules and the full dataset:

%prun -l 25  %time f3(profile.events)
CPU times: user 1min 14s, sys: 16.2 s, total: 1min 30s
Wall time: 1min 48s

        542678 function calls (542650 primitive calls) in 108.678 seconds

   Ordered by: internal time
   List reduced from 629 to 25 due to restriction <25>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      640   23.633    0.037   23.633    0.037 {method '_append' of 'tables.hdf5extension.Array' objects}
       15   20.852    1.390   20.852    1.390 {method '_append_records' of 'tables.tableextension.Table' objects}
      406   19.584    0.048   19.584    0.048 {method '_g_write_slice' of 'tables.hdf5extension.Array' objects}
    14244   10.591    0.001   10.591    0.001 {method '_g_read_slice' of 'tables.hdf5extension.Array' objects}
      458    9.693    0.021    9.693    0.021 {method 'copy' of 'numpy.ndarray' objects}
       15    6.350    0.423   30.989    2.066 pytables.py:3498(write_data_chunk)
       80    3.496    0.044    3.496    0.044 {tables.indexesextension.keysort}
       41    3.335    0.081    3.376    0.082 {method '_fill_col' of 'tables.tableextension.Row' objects}
       20    2.551    0.128    2.551    0.128 {method 'ravel' of 'numpy.ndarray' objects}
      101    2.449    0.024    2.449    0.024 {method 'astype' of 'numpy.ndarray' objects}
       16    1.789    0.112    1.789    0.112 {method '_g_flush' of 'tables.hdf5extension.Leaf' objects}
        2    1.728    0.864    1.728    0.864 common.py:197(_isnull_ndarraylike)
       41    0.586    0.014    0.842    0.021 index.py:637(final_idx32)
    14490    0.292    0.000    0.616    0.000 array.py:368(_interpret_indexing)
        2    0.283    0.142   10.267    5.134 index.py:1158(get_neworder)
      274    0.251    0.001    0.251    0.001 {method 'reduce' of 'numpy.ufunc' objects}
       39    0.174    0.004   19.373    0.497 index.py:1280(reorder_slice)
    57857    0.085    0.000    0.085    0.000 {numpy.core.multiarray.empty}
        1    0.083    0.083   35.657   35.657 pytables.py:3424(write_data)
        1    0.065    0.065   45.338   45.338 pytables.py:3385(write)
    14164    0.065    0.000    7.831    0.001 array.py:615(__getitem__)
    28570    0.062    0.000    0.108    0.000 utils.py:47(is_idx)
       47    0.055    0.001    0.055    0.001 {numpy.core.multiarray.arange}
    28570    0.050    0.000    0.090    0.000 leaf.py:397(_process_range)
    87797    0.048    0.000    0.048    0.000 {isinstance}
Asked By: Peter Gaultney

||

Answers:

Here is a similar comparison I just did. Its about 1/3 of the data 10M rows. The final size is abou 1.3GB

I define 3 timing functions:

Test the Fixed format (called Storer in 0.12). This writes in a PyTables Array format

def f(df):
    store = pd.HDFStore('test.h5','w')
    store['df'] = df
    store.close()

Write in the Table format, using PyTables Table format. Do not create an index.

def f2(df):
    store = pd.HDFStore('test.h5','w')
    store.append('df',df,index=False)
    store.close()

Same as f2, but create an index (which is normally done)

def f3(df):
    store = pd.HDFStore('test.h5','w')
    store.append('df',df)
    store.close()

Create the frame

In [25]: df = concat([DataFrame(np.random.randn(10000000,10)),DataFrame(np.random.randint(0,10,size=50000000).reshape(10000000,5))],axis=1)

In [26]: df
Out[26]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000000 entries, 0 to 9999999
Columns: 15 entries, 0 to 4
dtypes: float64(10), int64(5)


v0.12.0

In [27]: %timeit f(df)
1 loops, best of 3: 14.7 s per loop

In [28]: %timeit f2(df)
1 loops, best of 3: 32 s per loop

In [29]: %timeit f3(df)
1 loops, best of 3: 40.1 s per loop

master/v0.13.0

In [5]: %timeit f(df)
1 loops, best of 3: 12.9 s per loop

In [6]: %timeit f2(df)
1 loops, best of 3: 17.5 s per loop

In [7]: %timeit f3(df)
1 loops, best of 3: 24.3 s per loop

Timing Runs with the same file as provided by the OP (link is below)

In [4]: df = pd.read_hdf('test.h5','df')

In [5]: df
Out[5]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 28880943 entries, 0 to 28880942
Columns: 14 entries, node_id to kernel_type
dtypes: float64(4), int64(10)

Like f1, Fixed format

In [6]: %timeit df.to_hdf('test.hdf','df',mode='w')
1 loops, best of 3: 36.2 s per loop

Like f2, Table format, no index

In [7]: %timeit df.to_hdf('test.hdf','df',mode='w',format='table',index=False)
1 loops, best of 3: 45 s per loop

In [8]: %timeit df.to_hdf('test.hdf','df',mode='w',format='table',index=False,chunksize=2000000)
1 loops, best of 3: 44.5 s per loop

Like f3, Table format with index

In [9]: %timeit df.to_hdf('test.hdf','df',mode='w',format='table',chunksize=2000000)
1 loops, best of 3: 1min 36s per loop

Like f3, Table format with index, compressed with blosc

In [10]: %timeit df.to_hdf('test.hdf','df',mode='w',format='table',chunksize=2000000,complib='blosc')
1 loops, best of 3: 46.5 s per loop

In [11]: %timeit pd.read_hdf('test.hdf','df')
1 loops, best of 3: 10.8 s per loop

Show original file (test.h5, and compressed, test.hdf)

In [13]: !ls -ltr test.h*
-rw-r--r-- 1 jreback users 3471518282 Nov 20 18:20 test.h5
-rw-rw-r-- 1 jreback users  649327780 Nov 20 21:17 test.hdf

Several points to note.

  • Not creating an index can make a non-trivial difference in time. I also believe that if you have a string based index it can substantially worsen write time. That said, you always want to create an index to make retrieval very fast.

    You didn’t include what your index is, nor whether its sorted (though I only think this makes a small difference).

  • the write penalty in my examples are roughly 2x (though I have seen it be somewhat bigger when INCLUDING the index time). Thus your 7s (1/2 of my time), for 3x the number I am writing is quite suspect. I am using a reasonably fast disk array. If you were using a flash based disk then this is possible, though.

  • master/v0.13.0 (release very soon), improves the write times on tables substantially.

  • you can try setting the chunksize parameter to a bigger number when you write the data (its default is 100000). The purpose of the ‘relatively’ low number is to have a constant memory usage. (e.g. if is bigger you will use more memory, in theory it should write faster though).

  • Tables offer 2 advantages over Fixed format: 1) query retrieval, and 2) appendability. Reading the entire table doesn’t take advantage of either, so if you ONLY want to read the entire table, then Fixed format is recommended. (In my experience the flexibility of Tables greatly outweights the write penalty, but YMMV)

Bottom line is to repeat the timings (use ipython as it will run multiple tests). If you can reproduce your results, then pls post a %prun and I’ll take a look.

Update:

so the recommended way for a table this size is to compress with blosc and use pandas master/0.13.0 along with PyTables 3.0.0

Answered By: Jeff

That’s an interesting discussion. I think Peter is getting awesome performance for the Fixed format because the format writes in a single shot and also that he has a really good SSD (it can write at more than 450 MB/s).

Appending to table is a more complex operation (the dataset has to be enlarged, and new records must be checked so that we can ensure that they follow the schema of the table). This is why appending rows in tables is generally slower (but still, Jeff is getting ~ 70 MB/s, which is pretty good). That Jeff is getting more speed than Peter is probably due to the fact that he has a better processor.

Finally, indexing in PyTables uses a single processor, yes, and that normally is an expensive operation, so you should really disable it if you are not going to query data on-disk.

Answered By: Francesc