Storing a list of strings to a HDF5 Dataset from Python

Question:

I am trying to store a variable length list of string to a HDF5 Dataset. The code for this is

import h5py
h5File=h5py.File('xxx.h5','w')
strList=['asas','asas','asas']  
h5File.create_dataset('xxx',(len(strList),1),'S10',strList)
h5File.flush() 
h5File.Close()  

I am getting an error stating that “TypeError: No conversion path for dtype: dtype(‘&lt U3’)”
where the &lt means actual less than symbol
How can I solve this problem.

Asked By: gman

||

Answers:

You’re reading in Unicode strings, but specifying your datatype as ASCII. According to the h5py wiki, h5py does not currently support this conversion.

You’ll need to encode the strings in a format h5py handles:

asciiList = [n.encode("ascii", "ignore") for n in strList]
h5File.create_dataset('xxx', (len(asciiList),1),'S10', asciiList)

Note: not everything encoded in UTF-8 can be encoded in ASCII!

Answered By: SlightlyCuban

From https://docs.h5py.org/en/stable/special.html:

In HDF5, data in VL format is stored as arbitrary-length vectors of a
base type. In particular, strings are stored C-style in
null-terminated buffers. NumPy has no native mechanism to support
this. Unfortunately, this is the de facto standard for representing
strings in the HDF5 C API, and in many HDF5 applications.

Thankfully, NumPy has a generic pointer type in the form of the
“object” (“O”) dtype. In h5py, variable-length strings are mapped to
object arrays. A small amount of metadata attached to an “O” dtype
tells h5py that its contents should be converted to VL strings when
stored in the file.

Existing VL strings can be read and written to with no additional
effort; Python strings and fixed-length NumPy strings can be
auto-converted to VL data and stored.

Example

In [27]: dt = h5py.special_dtype(vlen=str)

In [28]: dset = h5File.create_dataset('vlen_str', (100,), dtype=dt)

In [29]: dset[0] = 'the change of water into water vapour'

In [30]: dset[0]
Out[30]: 'the change of water into water vapour'
Answered By: yardstick17

I am in a similar situation wanting to store column names of dataframe as a dataset in hdf5 file. Assuming df.columns is what I want to store, I found the following works:

h5File = h5py.File('my_file.h5','w')
h5File['col_names'] = df.columns.values.astype('S')

This assumes the column names are ‘simple’ strings that can be encoded in ASCII.

Answered By: Rajendra Koppula
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.