Can HDF5 data be read as a byte stream in real time in Python?

Question:

We have access to a multi gigabyte HDF5 file as it’s being written over the course of many minutes. We would like to pull the most recent data written to the file as it becomes available (sub second time-frame).

Is there any way to read an HDF5 file as a stream of bytes as they are written?

I see this question (Read HDF5 in streaming in java) w.r.t. Java which seems to suggest streaming might be possible with lower level HDF5 tools, but aren’t in that particular java package.

Of particular note the H5PY python package has a set of low level APIs which I’m not familiar enough with to know if they offer a solution.

https://api.h5py.org/

Asked By: David Parks

Source

Answers:

The key to reading data streaming over a high latency, high bandwidth network connection is to reduce the number of calls to read(n) on the file, these calls are sequential. HDF5 has a feature called the User Block Size which is set when the file is created or reset by using the h5repack tool.

The user block size is described in the SO article below. To summarize it here, data is stored in chunks of a user specified dimension. For example a table with shape 1Mx128 could have a block size of 10kx1 which would store data in 10k chunks (1 column).

What is the block size in HDF5?

When reading data from a python object (which is typical if you have a network accessed file) any access to the data will result in about a half dozen small header reads, then the data reads will be 1 read(n) per each user block size. Calls to read(n) are (unfortunately) sequential, so many small reads will be slow over the network. So setting a block size to something reasonable for your use case will reduce the number of read(n) calls.

Note that there can often be a tradeoff here. Setting a block size of 10kx128 forces all 128 columns to be read, you can’t read just 1 column with that block size. But setting a block size of 10kx1 means that a read of all 128 channels will result in 128 read(n) calls per each 10k rows.

If your data is not packed efficiently for your purpose you can repack it (a slow one-time process that doesn’t change the data, just the packing order) using h5repack.

Answered By: David Parks

I think what you are asking for is possible with HDF5 SWMR (Single-Writer/Multiple-Reader). The user guide describes how it works, and there is now support in h5py with examples.

Answered By: James Mudd