How to save a list in a pandas dataframe cell to a HDF5 table format?

Question:

I have a dataframe that I want to save in the appendable format to a hdf5 file. The dataframe looks like this:

    column1
0   [0, 1, 2, 3, 4]

And the code that replicates the issue is:

import pandas as pd
test = pd.DataFrame({"column1":[list(range(0,5))]})
test.to_hdf('test','testgroup',format="table")

Unfortunately, it returns this error:

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-65-c2dbeaca15df> in <module>
      1 test = pd.DataFrame({"column1":[list(range(0,5))]})
----> 2 test.to_hdf('test','testgroup',format="table")

7 frames

/usr/local/lib/python3.7/dist-packages/pandas/io/pytables.py in _maybe_convert_for_string_atom(name, block, existing_col, min_itemsize, nan_rep, encoding, errors, columns)
   4979                 error_column_label = columns[i] if len(columns) > i else f"No.{i}"
   4980                 raise TypeError(
-> 4981                     f"Cannot serialize the column [{error_column_label}]n"
   4982                     f"because its data contents are not [string] but "
   4983                     f"[{inferred_type}] object dtype"

TypeError: Cannot serialize the column [column1]
because its data contents are not [string] but [mixed] object dtype

I am aware that I can save each value in a separate column. This does not help my extended use case, as there might be variable length lists.

I know I could convert the list to a string and then recreate it based on the string, but if I start converting each column to string, I might as well use a text format, like csv, instead of a binary one like hdf5.

Is there a standard way of saving lists into hdf5 table format?

Asked By: Andrei

||

Answers:

Python Lists present a challenge when writing to HDF5 because they may contain different types. For example, this is a perfectly valid list: [1, 'two', 3.0]. Also, if I understand your Pandas 'column1' dataframe, it may contain different length lists. There is no (simple) way to represent this as an HDF5 dataset.
[That’s why you got the [mixed] object dtype message. The conversion of the dataframe creates an intermediate object that is written as a dataset. The dtype of the converted list data is "O" (object), and HDF5 doesn’t support this type.]

However, all is not lost. If we can make some assumptions about your data, we can wrangle it into a HDF5 dataset. Assumptions: 1) all df list entities are the same type (int in this case), and 2) all df lists are the same length. (We can handle different length lists, but it is more complicated.) Also, you will need to use a different package to write the HDF5 data (either PyTables or h5py). PyTables is the underlying package for Pandas HDF5 support and h5py is widely used. The choice is yours.

Before I post the code, here is an outline of the process:

  1. Create a NumPy record array (aka recarray) from the the dataframe
  2. Define the desired type and shape for the HDF5 dataset (as an Atom for
    Pytables, or a dtype for h5py).
  3. Create the dataset with Ataom/dtype definition above (could do on 1 line, but
    easier to read this way).
  4. Loop over rows of the recarray (from Step 1), and write data to rows of
    the dataset. This converts the List to the equivalent array.

Code to create recarray (adds 2 rows to your dataframe):

import pandas as pd
test = pd.DataFrame({"column1":[list(range(0,5)), list(range(10,15)), list(range(100,105))]})
# create recarray from the dataframe (use index='column1' to only get that column)
rec_arr = test.to_records(index=False)

PyTables specific code to export data:

import tables as tb
with tb.File('74489101_tb.h5', 'w') as h5f:
    # define "atom" with type and shape of column1 data
    df_atom = tb.Atom.from_type('int32', shape=(len(rec_arr[0]['column1']),) )
    # create the dataset
    test = h5f.create_array('/','test', shape=rec_arr.shape, atom=df_atom )
    # loop over recarray and polulate dataset
    for i in range(rec_arr.shape[0]):
        test[i] = rec_arr[i]['column1']
    print(test[:])  

h5py specific code to export data:

import h5py
with h5py.File('74489101_h5py.h5', 'w') as h5f:
    df_dt = (int,(len(rec_arr1[0]['column1']),))
    test = h5f.create_dataset('test', shape=rec_arr1.shape, dtype=df_dt )
    for i in range(rec_arr1.shape[0]):
        test[i] = rec_arr1[i]['column1']
    print(test[:]) 
Answered By: kcw78