Pandas to HDF5?
Question:
How do I convert a big table in Pandas/Numpy to h5 format with the same structure? I used the next code, but received .h5 version with messy data
data.to_hdf('data.h5',format = 'table', key='data')
I attached the image with my data
Or which data type can you recommend ?
I received the next structure
Answers:
Setting format='table'
writes the data as a PyTables Table. When you do this all the data will be in the ‘table’ dataset in group defined by key=
. However, data of common data types will be grouped together in 1 ‘values_block_#’ column/field (all ints, all floats, etc). To write them separately, you also need data_columns=True
. That defines the columns to be created as indexed data columns (set to True
to use all columns).
Example below demonstrate the differences from each option. It creates 3 different files using data from your example. If you still don’t like the format with data_columns=True
, you can use h5py or tables (PyTables) package to create the HDF5 schema and write the data as you like.
- file_1.h5 – uses default format (‘fixed’)
- file_2.h5 – uses ‘table’ format (only)
- file_3.h5 – uses ‘table’ format with
data_columns=True
Code below:
id = [f'subj8_series8_{i}' for i in range(5) ] +
[f'subj8_series8_12409{i}' for i in range(5) ]
Fp1 = [ 12, 157, 34, -98, 28,
-160, -30, 64, 134, 159 ]
Fp2 = [ 60, 181, 111, 25, 120,
192, 261, 322, 383, 407 ]
df = pd.DataFrame({'id': id, 'Fp1': Fp1, 'Fp2': Fp2})
df.to_hdf('file_1.h5', key='data')
df.to_hdf('file_2.h5', key='data', format='table')
df.to_hdf('file_3.h5', key='data', format='table', data_columns=True)
How do I convert a big table in Pandas/Numpy to h5 format with the same structure? I used the next code, but received .h5 version with messy data
data.to_hdf('data.h5',format = 'table', key='data')
I attached the image with my data
Or which data type can you recommend ?
I received the next structure
Setting format='table'
writes the data as a PyTables Table. When you do this all the data will be in the ‘table’ dataset in group defined by key=
. However, data of common data types will be grouped together in 1 ‘values_block_#’ column/field (all ints, all floats, etc). To write them separately, you also need data_columns=True
. That defines the columns to be created as indexed data columns (set to True
to use all columns).
Example below demonstrate the differences from each option. It creates 3 different files using data from your example. If you still don’t like the format with data_columns=True
, you can use h5py or tables (PyTables) package to create the HDF5 schema and write the data as you like.
- file_1.h5 – uses default format (‘fixed’)
- file_2.h5 – uses ‘table’ format (only)
- file_3.h5 – uses ‘table’ format with
data_columns=True
Code below:
id = [f'subj8_series8_{i}' for i in range(5) ] +
[f'subj8_series8_12409{i}' for i in range(5) ]
Fp1 = [ 12, 157, 34, -98, 28,
-160, -30, 64, 134, 159 ]
Fp2 = [ 60, 181, 111, 25, 120,
192, 261, 322, 383, 407 ]
df = pd.DataFrame({'id': id, 'Fp1': Fp1, 'Fp2': Fp2})
df.to_hdf('file_1.h5', key='data')
df.to_hdf('file_2.h5', key='data', format='table')
df.to_hdf('file_3.h5', key='data', format='table', data_columns=True)