tf.data pipeline from large numpy arrays for a multiple input, multiple output Keras model and distributed training

Question:

This question relates to the optimal setup for a multiple-input multiple-output Keras (Tensorflow) model given corresponding numpy arrays.

For example, suppose input arrays x1 and x2 and output arrays y1 and y2. A tf.data Dataset can be built as follows:

train_data = tf.data.Dataset.zip(
    (
        tf.data.Dataset.from_tensor_slices(
            (
                x1,
                x2,
            )
        ),
        tf.data.Dataset.from_tensor_slices(
            (
                y1,
                y2,
            ),
        ),
    )
)

The above code works for small arrays and single GPU training. The two constraints that make this approach impossible/inadvisable in the full data/model are:

  1. The numpy arrays are large enough to cross the 2 GB Protobuf limit.

  2. Use of tf.distribute.MirroredStrategy to distribute the training over multiple GPUs.

What is the best way to pipeline the data?

Asked By: Anirban Mukherjee

||

Answers:

The best approach seems to be to

(1) Create tf data datasets from each array independently and use tf zip to zip them into a single dataset. This should enable one to get around the 2 GB limit.

(2) Use the concatenate method to chain tf.data datasets.

(3) Use distribute strategy to define and compile the keras model and model.fit to train the model. Distribute strategy handles most of the heavy lifting of both estimating the model on different GPUs and sharing the data as needed to facilitate multiple GPU training.

If each array is too large to form a tf data dataset using from_tensor_slices (https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_tensor_slices) then this approach will not work. In that case, I found that it was better to split the single large arrays into multiple distinct arrays and use the above procedure. Many typical solutions (e.g. https://www.pythonfixing.com/2022/01/fixed-using-datasets-from-large-numpy.html) are not designed for distributed training.

Answered By: Anirban Mukherjee
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.