Validation loss is lower than training loss training LSTM

Question:

I am training a LSTM using tf.learn in tensorflow. I have split the data into training (90%) and validation (10%) for this purpose. As I understand, a model usually fits better training data than validation data but I am getting the opposite results. Loss is lower and accuracy is higher for validation set.

As I have read in other answers, this can be because of dropout not being applied during validation. However, when I remove dropout from my LSTM architecture I validation loss is still lower than training loss (difference is smaller though).

Also, the loss shown at the end of each epoch is not an average of the losses over each batch (like when using Keras). It is the loss for he last batch. I also thought this could be a reason for my results but turned out it was not.

Training samples: 783
Validation samples: 87
--
Training Step: 4  | total loss: 1.08214 | time: 1.327s
| Adam | epoch: 001 | loss: 1.08214 - acc: 0.7549 | val_loss: 0.53043 - val_acc: 0.9885 -- iter: 783/783
--
Training Step: 8  | total loss: 0.41462 | time: 1.117s
| Adam | epoch: 002 | loss: 0.41462 - acc: 0.9759 | val_loss: 0.17027 - val_acc: 1.0000 -- iter: 783/783
--
Training Step: 12  | total loss: 0.15111 | time: 1.124s
| Adam | epoch: 003 | loss: 0.15111 - acc: 0.9984 | val_loss: 0.07488 - val_acc: 1.0000 -- iter: 783/783
--
Training Step: 16  | total loss: 0.10145 | time: 1.114s
| Adam | epoch: 004 | loss: 0.10145 - acc: 0.9950 | val_loss: 0.04173 - val_acc: 1.0000 -- iter: 783/783
--
Training Step: 20  | total loss: 0.26568 | time: 1.124s
| Adam | epoch: 005 | loss: 0.26568 - acc: 0.9615 | val_loss: 0.03077 - val_acc: 1.0000 -- iter: 783/783
--
Training Step: 24  | total loss: 0.11023 | time: 1.129s
| Adam | epoch: 006 | loss: 0.11023 - acc: 0.9863 | val_loss: 0.02607 - val_acc: 1.0000 -- iter: 783/783
--
Training Step: 28  | total loss: 0.07059 | time: 1.141s
| Adam | epoch: 007 | loss: 0.07059 - acc: 0.9934 | val_loss: 0.01882 - val_acc: 1.0000 -- iter: 783/783
--
Training Step: 32  | total loss: 0.03571 | time: 1.122s
| Adam | epoch: 008 | loss: 0.03571 - acc: 0.9977 | val_loss: 0.01524 - val_acc: 1.0000 -- iter: 783/783
--
Training Step: 36  | total loss: 0.05084 | time: 1.120s
| Adam | epoch: 009 | loss: 0.05084 - acc: 0.9948 | val_loss: 0.01384 - val_acc: 1.0000 -- iter: 783/783
--
Training Step: 40  | total loss: 0.22283 | time: 1.132s
| Adam | epoch: 010 | loss: 0.22283 - acc: 0.9714 | val_loss: 0.01227 - val_acc: 1.0000 -- iter: 783/783

The network used (note that dropout has been commented out):

def get_network_wide(frames, input_size, num_classes):
    """Create a one-layer LSTM"""
    net = tflearn.input_data(shape=[None, frames, input_size])
    #net = tflearn.lstm(net, 256, dropout=0.2)
    net = tflearn.fully_connected(net, num_classes, activation='softmax')
    net = tflearn.regression(net, optimizer='adam',
                             loss='categorical_crossentropy',metric='default', name='output1')
    return net 

Plot of validation loss vs training loss

Asked By: s0x

||

Answers:

This is not necessarily a problematic phenomenon in essence.

It can take place due to many reasons, as stated below.

  1. It may happen usually when your training data is harder to train on/learn patters on it, while the validation set boasts ‘easy’ images/data to classify on. The same is available for LSTM/sequence classification data.
  2. It may happen in the early phase of the training that the validation loss is smaller than the training loss + validation accuracy is bigger than the training accuracy.
  3. During validation, dropout is not enabled, leading to greater results on the validation set.
  4. The loss on training is calculated epoch-wise. At the end of an epoch, it’s the mean of batch losses (accumulated) throughout the epoch. The network learned patterns/relationships between&within the data, and when testing it against the validation, it uses the information already learned and it follows that the results can be better on validation E.g. TLosses [0.60,0.59,...0.3 (loss on TS at the end of the epoch)] -> VLosses [0.3,0.29,0.35] (because the model has already trained a lot as compared to the start of the epoch.

However, your training set is very small, as well as your validation set. Such a split (90% on train and 10% on validation/development) should be made only when there is very much data (tens of thousands or even hundreds of thousands in this case). On the other hand, your entire training set(train + val) has less than 1000 samples. You need much more data, as LSTMs are well-known for requiring a lot of training data.

Then, you could try using KFoldCrossValidation or even StratifiedKFoldCrossValidation. In this way, you would ensure that you have not manually created a very ‘easy’ validation set, on which you always test; instead, you can have k-folds, out of which k-1 are used for training and 1 for validation; in this way you may avoid situation (1).

The answer lies in the data. Prepare it carefully, as the results depend significantly on the quality of data (preprocessing data, cleaning data, creating relevant training/validation/test sets).

Answered By: Timbus Calin