Save data to hdf5 (e.g. from pandas df) with structure copied from another hdf5-file

Question:

I have astronomical data in columns, that look something like this

M200c, M200m, dec, ra
19.4,  20.4,  1.33, 4.68
...

I need to save my data in hdf5 format, so it can be fed into a script. I know the structure, that the hdf5 file should have, from an example that is provided, it delivers this structure:

import nexusformat.nexus as nx
f = nx.nxload('example_input_file.hdf5')
print(f.tree)

>>> root:NXroot
>>>  Data:NXgroup
>>>    M200c = float32(735697)
>>>    M200m = float32(735697)
>>>    dec = float32(735697)
>>>    ra = float32(735697)

Naively, I thought that I can just load my data into a pandas df, then save it to hdf5 like this

import pandas as pd

df ... # I do some data loading and processing here and eventually...
df.to_hdf('my_data_input_file.hdf5', key='df', mode='w') 

but pandas produces a very different and convoluted structure. Hence, when I feed my hdf5 input file to the script it gives me an error ‘KeyError: ‘Unable to open object (component not found)”.

So is there a way/package with which I can copy the structure of my example hdf5 and reproduce it when saving my data? Or can you provide me with a more hardcoded solution, maybe a loop through the names of all the columns that populates an empty hdf5? I am completely new to this format and don’t know how it works. Tnx

Asked By: NeStack

||

Answers:

Yes, as you discovered, pandas uses predefined schemas when wrting HDF5 data and doesn’t give you much control. I answered a similar question a few days ago. You can get close with the following pandas options: key='NXroot', format='table', data_columns=True. However, you won’t be able to mimic the schema exactly. See this answer for some examples of that behavior: Pandas to HDF5?

Both h5py and Pytables (aka tables) packages can be used to create an HDF5 file exactly as you desire. And, it’s relatively easy to do with either of them once you know how to access the dataframe columns and write to individual datasets. Since PyTables is part of the Pandas HDF5 stack, it might be simpler (for you) to implement. That said, h5py is also popular. I use both packages, and like each for different reasons.

The process is similar with either package:

  1. Create the file
  2. Then create the ‘Data’ group.
  3. Loop over the dataframe columns, getting the names and data.
  4. Create a dataset in the ‘/Data’ group with the column name and write the column’s data values to it.
  5. Although written for a small example, both examples should work as-is
    with your dataframe.

Code to create a simple dataframe to use in this example.

import pandas as pd
M200c = [ 19.4, 18.2, 11.5, 13.6, 27.1,
          11.7, 15.5, 23.3, 31.1, 22.2 ] 
M200m = [ 20.4, 15.7, 34.3, 18.0, 28.2,
          16.5, 30.0, 24.4, 17.7, 15.9 ]
dec = [ 1.33, 1.81, 1.11, 2.15, 1.20,
        1.92, 2.61, 3.22, 3.83, 4.07 ]
ra = [ 4.68, 4.81, 5.11, 5.25, 6.12,
       7.92, 5.61, 3.22, 3.83, 4.07 ]
df = pd.DataFrame({'M200c': M200c, 'M200m': M200m, 'dec': dec, 'ra':ra})

Code to create the file using PyTables (tables):

import tables as tb
with tb.File('file_tb.h5', 'w') as h5f:
    NXgrp = h5f.create_group('/','Data', createparents=True)
    for (colName, colData) in df.items():
        h5f.create_array(NXgrp, colName, obj=colData.values)

Code to create the file using h5py:

import h5py
with h5py.File('file_h5py.h5', 'w') as h5f:
    NXgrp = h5f.create_group('Data') 
    for (colName, colData) in df.items():
        NXgrp.create_dataset(colName, data=colData.values)
Answered By: kcw78