KeyError: ''val_loss" when training model

Question

I am training a model with keras and am getting an error in callback in fit_generator function. I always run to epoch 3rd and get this error

annotation_path = 'train2.txt'
    log_dir = 'logs/000/'
    classes_path = 'model_data/deplao_classes.txt'
    anchors_path = 'model_data/yolo_anchors.txt'
    class_names = get_classes(classes_path)
    num_classes = len(class_names)
    anchors = get_anchors(anchors_path)

    input_shape = (416,416) # multiple of 32, hw

    is_tiny_version = len(anchors)==6 # default setting
    if is_tiny_version:
        model = create_tiny_model(input_shape, anchors, num_classes,
            freeze_body=2, weights_path='model_data/tiny_yolo_weights.h5')
    else:
        model = create_model(input_shape, anchors, num_classes,
            freeze_body=2, weights_path='model_data/yolo_weights.h5') # make sure you know what you freeze

    logging = TensorBoard(log_dir=log_dir)
    checkpoint = ModelCheckpoint(log_dir + 'ep{epoch:03d}-loss{loss:.3f}-val_loss{val_loss:.3f}.h5',
        monitor='val_loss', save_weights_only=True, save_best_only=True, period=3)

    reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=3, verbose=1)
    early_stopping = EarlyStopping(monitor='val_loss', min_delta=0, patience=10, verbose=1)


[error]
Traceback (most recent call last):
  File "train.py", line 194, in <module>
    _main()
  File "train.py", line 69, in _main
    callbacks=[logging, checkpoint])
  File "C:UsersiloveAppDataRoamingPythonPython37libsite-packageskeraslegacyinterfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "C:UsersiloveAppDataRoamingPythonPython37libsite-packageskerasenginetraining.py", line 1418, in fit_generator
    initial_epoch=initial_epoch)
  File "C:UsersiloveAppDataRoamingPythonPython37libsite-packageskerasenginetraining_generator.py", line 251, in fit_generator
    callbacks.on_epoch_end(epoch, epoch_logs)
  File "C:UsersiloveAppDataRoamingPythonPython37libsite-packageskerascallbacks.py", line 79, in on_epoch_end
    callback.on_epoch_end(epoch, logs)
  File "C:UsersiloveAppDataRoamingPythonPython37libsite-packageskerascallbacks.py", line 429, in on_epoch_end
    filepath = self.filepath.format(epoch=epoch + 1, **logs)
KeyError: 'val_loss'

can anyone find out problem to help me?

Thanks in advance for your help.

Asked By: Phuc Nguyen

||

Source

Answer 1

This callback runs at the end of iteration 3.

    checkpoint = ModelCheckpoint(log_dir + 'ep{epoch:03d}-loss{loss:.3f}-val_loss{val_loss:.3f}.h5',
        monitor='val_loss', save_weights_only=True, save_best_only=True, period=3)

The error message is claiming that there is no val_loss in the logs variable when executing:

filepath = self.filepath.format(epoch=epoch + 1, **logs)

This would happen if fit is called without validation_data.

I would start by simplifying the path name for model checkpoint. It is probably enough to include the epoch in the name.

Answered By: Pedro Marques

Answer 2

This answer doesn’t apply to the question, but this was at the top of the Google results for keras "KeyError: 'val_loss'" so I’m going to share the solution for my problem.

The error was the same for me: when using val_loss in the checkpoint file name, I would get the following error: KeyError: 'val_loss'. My checkpointer was also monitoring this field, so even if I took the field out of the file name, I would still get this warning from the checkpointer: WARNING:tensorflow:Can save best model only with val_loss available, skipping.

In my case, the issue was that I was upgrading from using Keras and Tensorflow 1 separately to using the Keras that came with Tensorflow 2. The period param for ModelCheckpoint had been replaced with save_freq. I erroneously assumed that save_freq behaved the same way, so I set it to save_freq=1 thinking this would save it every epic. However, the docs state:

save_freq: ‘epoch’ or integer. When using ‘epoch’, the callback saves the model after each epoch. When using integer, the callback saves the model at end of a batch at which this many samples have been seen since last saving. Note that if the saving isn’t aligned to epochs, the monitored metric may potentially be less reliable (it could reflect as little as 1 batch, since the metrics get reset every epoch). Defaults to ‘epoch’

Setting save_freq='epoch' solved the issue for me. Note: the OP was still using period=1 so this is definitely not what was causing their problem

Answered By: JoshuaCWebDeveloper

Answer 3

For me the problem was that I was trying to set the initial_epoch (in model.fit) to a value other than the standard 0. I was doing so because I’m running model.fit in a loop that runs 10 epochs each cycle, then retrieves history data, checks if loss has decreased and runs model.fit again until it’s satisfied.
I thought I had to update the value as I was restarting the previous model but apparently no…

switch = True
epoch = 0
wait = 0
previous = 10E+10
while switch:
    history = model.fit( X, y, batch_size=1, epochs=step, verbose=False )
    epoch += step
    current = history.history["loss"][-1]
    if current >= previous:
        wait += 1
        if wait >= tolerance:
            switch = False
    else:
        wait = 0
    if epoch >= max_epochs:
        switch = False
    previous = current

Answered By: Vasco Cansado Carvalho

Answer 4

In my case, the val_generator was broken when colab notebook try to read the images from google drive. So i run the cell create val_generator again and it worked

Answered By: Vo Trung

Answer 5

I do not know if this will work in all cases. But, for me I restarted my computer and it seemed to work.

Answered By: GILO

Answer 6

Use val_accuracy in the filepath and checkpoint. If it still doesn’t improve just restart the pc or colab.

Answered By: nikhil sharma

Answer 7

this error happens when we are not providing validation data to the model,
And check the parameters of the model.fit_generator(or model.fit)(train_data, steps_per_epoch,validation_data, validation_steps, epochs,initial_epoch, callbacks)

Answered By: Harsha Ys

Answer 8

I had this error and didn’t manage to find the cause of the bug anywhere online.

What was happening in my case was that I was asking for more training samples than I actually had. TF didn’t give me an explicit error for that and it even provided me with a saved value for the loss. I only received the esoteric KeyError: "val_loss" when trying to save that.

Hope this helps someone sniff out their silly bug if that’s whats happening to them.

Answered By: Sammy

Answer 9

should add this parameter validation_data=(x_test,y_test) to fit model

Answered By: Karen

KeyError: ''val_loss" when training model

Question:

Answers: