How to merge multiple H5 to one H5 file with Python and h5py?

Question:

I am new to Python coding. I want to merge data from 2 H5 files to a main H5 file. My goal is to add all objects in the SRRXX/SRR630/* groups in each source file (file names in list h5_files) to the main (target) file (main_h5_path). The code below is my attempt to do this. When I run, I get this exception:

Error occurred during H5 merging: 'Group' object has no attribute 'encode'

I also tried create_group(), but get the same exception.

What do I need to modify to get my code to work?

#read the mainfile dataset
        with h5py.File(main_h5_path, 'r') as h5_main_file_obj:
            # return if H5 doesn't contain any data
            if len(h5_main_file_obj.keys()) == 0:
                return
            main_file_timestamp_dtset_obj = h5_main_file_obj['/' + 'SRR6XX' + '/' + 'SRR630']

            for file in h5_files:
                with h5py.File(file, 'r') as h5_sub_file_obj:
                    # return if H5 doesn't contain any data
                    if len(h5_sub_file_obj.keys()) == 0:
                        continue
                    sub_file_timestamp_dtset_obj = h5_sub_file_obj['/' + 'SRR6XX' + '/' + 'SRR630']
                    # h5_main_file_obj.create_dataset(sub_file_timestamp_dtset_obj)
                    for ts_key in sub_file_timestamp_dtset_obj.keys():
                        print('ts_key', ts_key)
                        each_ts_ds = h5_sub_file_obj['/' + 'SRR6XX' + '/' + 'SRR630' + '/' + str(ts_key) + '/']
                        h5_main_file_obj.create_dataset(each_ts_ds)


    except (IOError, OSError, Exception) as e:
        print(f"Error occurred during H5 merging: {e}")
        return -1
    return 0
Asked By: Sayan Bera

||

Answers:

My orginal answer only copied the group names under group '/SRR6XX/SRR630‘ in the source files to the main (target) file. OP commented they want to "copy the group names along with their datasets".
I updated my answer to reflect that request. It only requires a 1 line change. (For reference, the line to create groups is commented out.)

Here are the changes to your original code required to get this working:

  1. Main (target) file must be open in append mode to add new objects.
  2. ts_key in your loop is the object name (not the object). Use .items() to get names and objects (or just reference the object by name).
  3. You are creating the new object in the main (target) file at the root level.
    You need to modify to reference the appropriate group object (main_file_timestamp_dtset_obj)

Modified code below:

def your_function:

  with h5py.File(main_h5_path, 'a') as h5_main_file_obj: # need Append mode to add groups
    # return if H5 doesn't contain any data
    if len(h5_main_file_obj.keys()) == 0:
        return
    main_file_timestamp_dtset_obj = h5_main_file_obj['/SRR6XX/SRR630']

    for file in h5_files:
        with h5py.File(file, 'r') as h5_sub_file_obj:
            # return if H5 doesn't contain any data
            if len(h5_sub_file_obj.keys()) == 0:
                continue
            sub_file_timestamp_dtset_obj = h5_sub_file_obj['/SRR6XX/SRR630']
            # h5_main_file_obj.create_dataset(sub_file_timestamp_dtset_obj)
            for ts_key in sub_file_timestamp_dtset_obj.keys():
                print('ts_key:', ts_key)
                # This only creates group:
                #main_file_timestamp_dtset_obj.create_group(ts_key)
                # This copies Group and its objects (groups or datasets):
                grp_path = 'SRR6XX/SRR630/' + ts_key
                h5_sub_file_obj.copy(h5_sub_file_obj[grp_path], main_file_timestamp_dtset_obj)

I wrote another solution that is more compact and checks if source objects are Groups before copying. See below. Another check to consider: conflicts with existing group names in the main (target) file before copying each group. As noted in my comment, consider using External Links to avoid duplicate data.

def my_function():
      
    with h5py.File(main_h5_path, mode='a') as h5ft:
        if len(h5ft.keys()) == 0:
            return
        for h5_source in h5_files:
            with h5py.File(h5_source,'r') as h5fs:
                if len(h5ft.keys()) == 0:
                    continue
                for grp_name, h5_obj in h5fs['SRR6XX/SRR630'].items(): 
                    if isinstance(h5_obj,h5py.Group):
                        # This only creates group:
                        #h5ft['SRR6XX/SRR630'].create_group(grp_name) 
                        # This copies Group and its objects (groups or datasets):
                        grp_path = 'SRR6XX/SRR630/' + grp_name
                        h5fs.copy(h5fs[grp_path], h5ft['SRR6XX/SRR630'])
Answered By: kcw78
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.