Keras-tuning can't find callback

Question:

I am using keras-tuner in order to obtain the best set of hyperparameters for my model.
I can reproduce my problem for a random dataset:

def generate_data(n_windows, n_timesteps):
    feature_vector_list = []
    label_list = []
    for i in range(10):
        x = tf.random.normal((n_windows, n_timesteps))
        feature_vector = [x]
        choices = [np.array([1, 0]), np.array([0, 1]),
                   np.array([0, 0]), np.array([1,1])]
        labels = np.array([random.choice(choices) for i in range(n_windows)])
        feature_vector_list.append(feature_vector)
        label_list.append(labels)
    return feature_vector_list, label_list


def custom_generator(feat_vector_list, label_list):
    assert len(feat_vector_list) == len(label_list), 
        "Number of feature vectors inconsistent with the number of labels"
    counter = 0
    while True:
        feat_vec = feat_vector_list[counter]
        list_labels = label_list[counter]
        counter = (counter + 1) % len(feat_vector_list)
        yield feat_vec, list_labels

Here is the model:

def model_builder(hp):

    n_timesteps, n_features, n_outputs = 60, 1, 2

    hp_units = hp.Int("units", min_value=50, max_value=500, step=50)
    hp_filters = hp.Int("filters", 4, 32, step=4, default=8)
    hp_kernel_size = hp.Int("kernel_size", 3, 50, step=1)
    hp_pool_size = hp.Int("pool_size", 2, 8, step=1)
    hp_dropout = hp.Float("dropout", 0.1, 0.5, step=0.1)

    input1 = Input(shape=(n_timesteps, n_features))
    conv1 = Conv1D(filters=hp_filters,
                   kernel_size=hp_kernel_size,
                   activation='relu')(input1)
    drop1 = Dropout(hp_dropout)(conv1)
    if hp.Choice("pooling", ["max", "avg"]) == "max":
        pool1 = MaxPooling1D(pool_size=hp_pool_size)(drop1)
    else:
        pool1 = AveragePooling1D(pool_size=hp_pool_size)(drop1)
    flatten1 = Flatten()(pool1)
    # hidden layers
    dense1 = Dense(hp_units, activation='relu')(flatten1)
    outputs = Dense(n_outputs, activation='softmax')(dense1)
    model = Model(inputs=[input1, input2], outputs=outputs)
    model.compile(loss='categorical_crossentropy',
                  optimizer=tf.keras.optimizers.Adam(learning_rate=hp.Float("learning_rate",
                                                                            0.01,
                                                                            0.1,
                                                                            step=0.2)),
                  metrics=['accuracy'])
    return model

Here is the training script:

if __name__ == '__main__':
    x_train, y_train = generate_data(350, 60)
    x_val, y_val = generate_data(80, 60)
    training_generator = custom_generator(x_train, y_train)
    validation_generator = custom_generator(x_val, y_val)
    tuner = kt.Hyperband(
        model_builder,
        objective="val_accuracy",
        max_epochs=70,
        factor=3,
        directory="Results",
        project_name="cnn_tunning"
    )
    stop_early = tf.keras.callbacks.EarlyStopping(monitor='val_loss',
                                                  patience=5,
                                                  min_delta=0.002)

    tuner.search(
        training_generator,
        steps_per_epoch=N_WINDOWS,
        validation_data=validation_generator,
        validation_steps=75,
        callbacks=[stop_early],
    )

Now what I have found is that after the hyperband starts using a decent number of iterations and the callback I set up should come into play I get this error:

W tensorflow/core/framework/op_kernel.cc:1733] INVALID_ARGUMENT: ValueError: Could not find callback with key=pyfunc_530 in the registry.
Traceback (most recent call last):

  File "/home/diogomota/.cache/pypoetry/virtualenvs/WUAle-Z1-py3.7/lib/python3.7/site-packages/tensorflow/python/ops/script_ops.py", line 259, in __call__
    raise ValueError(f"Could not find callback with key={token} in the "

ValueError: Could not find callback with key=pyfunc_530 in the registry.


W tensorflow/core/kernels/data/generator_dataset_op.cc:107] Error occurred when finalizing GeneratorDataset iterator: INVALID_ARGUMENT: ValueError: Could not find callback with key=pyfunc_530 in the registry.
Traceback (most recent call last):

  File "/home/diogomota/.cache/pypoetry/virtualenvs/WUAle-Z1-py3.7/lib/python3.7/site-packages/tensorflow/python/ops/script_ops.py", line 259, in __call__
    raise ValueError(f"Could not find callback with key={token} in the "

ValueError: Could not find callback with key=pyfunc_530 in the registry.

However it just proceeds to the next trial so I’m not sure what is going on, can someone explain why it can’t find the callback?

I’m using tensorflow 2.8 and keras-tuner 1.1.2

I could only find one place online with a similar issue, but no solution provided: https://issuemode.com/issues/tensorflow/tensorflow/72982126

EDIT:

  1. Provided full error message
  2. After further debugging, the problem comes solely from using a generator as input for the .search(). I do not know the reason for this being an issue. Regular training using the .fit() works without any issues
  3. Added dataset generation code for reproducibility

Answers:

Looking at the source code of the error, and reviewing the similar error provided, it looks like this issue is not due to the actual model callback (tf.keras.callbacks.EarlyStoppingCallback). The error occurs in the FuncRegistry class, which is a helper that maintains a map of unique tokens to registered python functions, and it looks like in both cases, the token (pyfunc_XXX) does not map to a function. Functions are inserted here when _internal_py_func is called, while wrapping of a Python function (to be executed as an eager Tensorflow operation) or while computing the gradient of an eager function. The global registry of tokens to functions (the FuncRegistry object) is supplied to initialize_py_trampoline, which is bound to the InitializePyTrampoline function in C++ through PyBind, so the reference of token to function maps is maintained in the C++ runtime as well.

At that level, tracing the error to C++ source code from the logs, it’s occurring in the destructor of the inner class Iterator, a field of GeneratorDatasetOp. The destructor is called when the object is out of scope or explicitly deleted – meaning it would be called when the generator is finished it’s task, which sounds like it may be consistent with the observations you were making with when the error occurred.

In summary, without being able to probe much further without a dataset, it sounds like there may be a problem with the custom generator. I would recommend trying to perform the training without keras-tuner and the same generator implementation, to identify if the problem is consistent with other observation linked, as they were not using keras-tuner but they were using a custom generator. If the error persists, it would also be worth evaluating if previous releases (e.g; Tensorflow 2.7 or below) has the same problem with the generator. If it’s consistently failing, it may warrant submitting an actual issue to the Tensorflow Github repository, as it may actually be a core bug which requires further exploration.

Also, if you don’t need to use a generator (as in, the data can fit into memory), I would recommend trying to supply the dataset directly (calling fit with a list of numpy arrays or a numpy array instead of supplying generator functions), as that path won’t touch the DatasetGenerator code which is currently failing, and that should not affect your hyperparameter search.

Update

Thank you for the additional information and including code to replicate your generator functions. I was able to reproduce the issue in Python 3.7/Tensorflow 2.8/keras-tuner 1.1.2 on CPU. If you inspect the _funcs (the field in the global registry which maintains a dictionary of tokens to weak references to functions), it’s actually empty. Upon further inspection, it looks like every time a new trial is started, _funcs is cleared and repopulated, which is consistent if keras-tuner is creating a new graph (model) every time (although the same registry FuncRegistry is used throughout).

The error does not occur if the EarlyStopping callback is omitted, so you were correct to say the error is linked to the callback. It also appears the error is non-deterministic, as the trial and epoch of the occurrence varies per run.

With the cause of the error narrowed down, another person experienced the same issue, and their observations were the cause of the error being related to explicitly setting the min_delta parameter in the callback, as you are doing as well, which no other keras-tuner example does (e.g; in this example and this example from the documentation, they only have monitor and/or patience set).

The impact of setting min_delta in the EarlyStopping callback, which is set to 0 by default, can be seen here. Specifically, _is_improvement can evaluate to True less frequently when min_delta is set to some non-zero value:

    if self._is_improvement(current, self.best):
      self.best = current
      self.best_epoch = epoch
      if self.restore_best_weights:
        self.best_weights = self.model.get_weights()
      # Only restart wait if we beat both the baseline and our previous best.
      if self.baseline is None or self._is_improvement(current, self.baseline):
        self.wait = 0

  def _is_improvement(self, monitor_value, reference_value):
    return self.monitor_op(monitor_value - self.min_delta, reference_value)

Note that in your case, self.monitor_op is np.less, since the metric you’re monitoring is val_loss:

      if (self.monitor.endswith('acc') or self.monitor.endswith('accuracy') or
          self.monitor.endswith('auc')):
        self.monitor_op = np.greater
      else:
        self.monitor_op = np.less

When self._is_improvement is evaluated less frequently, the patience criterion (self.wait >= self.patience) will be met more often, since self.wait will reset less frequently (as self.baseline is None by default):

if self.wait >= self.patience and epoch > 0:
      self.stopped_epoch = epoch
      self.model.stop_training = True
      if self.restore_best_weights and self.best_weights is not None:
        if self.verbose > 0:
          io_utils.print_msg(
              'Restoring model weights from the end of the best epoch: '
              f'{self.best_epoch + 1}.')
        self.model.set_weights(self.best_weights)

With this narrowed down, it appears to have something to do with the model stopping training more frequently, and references to operations in the graph not existing anymore when keras-tuner is running a trial.

In simpler terms, it seems like a bug in keras-tuner that needs to be submitted, which I did here with all the details from this response. For purposes of proceeding in the meantime, if the min_delta criteria isn’t necessary, I would suggest removing that parameter from EarlyStopping and running the script again to see if you still hit the issue.

Update 2

Thank you for the additional information. I was able to reproduce the successful run if the generator is not used, and it also looks like the other case I referenced was also using a generator in conjunction with EarlyStopping with a min_delta supplied.

Upon some further inspection, the function which is not found in the registry is finalize_py_func, as in every token which causes that error maps to finalize_py_func before _funcs is cleared. finalize_py_func is the inner function wrapped by script_ops.numpy_function, which wraps a python function to be used as Tensorflow op. The function where finalize_py_func is defined and returned as Tensorflow op, finalize_fn, is supplied when constructing a generator, as can be seen here. Looking at the documentation of the finalize function in the generator here, it says "A TensorFlow function that will be called on the result of init_func` immediately before a C++ iterator over this dataset is destroyed.".

Overall, the error is related to the generator, not the min_delta parameter. While setting the min_delta expedites how quickly the error occurs, it can happen even if the min_delta is omitted if the patience is lowered enough to force the early stopping callback to trigger more frequently. Using your example, if you set patience to 1 and remove min_delta, the error appears pretty quickly.

I have revised the github issue to include that detail. It looks like the error still exists in Tensorflow 2.7, but if you downgrade to Tensorflow 2.6 (and Keras 2.6), the error does not occur. If downgrading is possible, that may be the best option for proceeding until the issue is addressed.

Answered By: danielcahall