Training gets progressively slower over time


This is the first time I’m experiencing this issue. I’ve been using this model for a while, but with less data. The problem is that in the first 3 epochs training took 11 sec/step (31k samples / 128 batch size) while in the 4-th epoch it took 18 sec/step. In the fifth it took about 45 sec/step. I’m using Keras and not doing any custom loop shenanigans.

Can someone explain this slowdown? The model hasn’t been interrupted. I’m using TF 2.3

Epoch 1/1200
248/248 [==============================] - 2727s 11s/step - loss: 2.3481 - acc: 0.3818 - top3_acc: 0.5751 - recall: 0.2228 - precision: 0.6195 - f1: 0.3239 - val_loss: 0.9020 - val_acc: 0.8085 - val_top3_acc: 0.8956 - val_recall: 0.5677 - val_precision: 0.9793 - val_f1: 0.7179
Epoch 2/1200
248/248 [==============================] - 2712s 11s/step - loss: 1.0319 - acc: 0.7203 - top3_acc: 0.8615 - recall: 0.5489 - precision: 0.9245 - f1: 0.6865 - val_loss: 0.5547 - val_acc: 0.8708 - val_top3_acc: 0.9371 - val_recall: 0.7491 - val_precision: 0.9661 - val_f1: 0.8435
Epoch 3/1200
248/248 [==============================] - 4426s 18s/step - loss: 0.7094 - acc: 0.8093 - top3_acc: 0.9178 - recall: 0.6830 - precision: 0.9446 - f1: 0.7920 - val_loss: 0.4399 - val_acc: 0.8881 - val_top3_acc: 0.9567 - val_recall: 0.8140 - val_precision: 0.9606 - val_f1: 0.8808
Epoch 4/1200
 18/248 [=>............................] - ETA: 3:14:16 - loss: 0.6452 - acc: 0.8338 - top3_acc: 0.9223 - recall: 0.7257 - precision: 0.9536 - f1: 0.8240

Edit: I just ran this on a super small sample(20 items / category) of the data and the step time does not increase. proof

Edit 2: Model summary

Model: "functional_3"
Layer (type)                    Output Shape         Param #     Connected to                     
input_token (InputLayer)        [(None, 300)]        0                                            
masked_token (InputLayer)       multiple             0           input_token[0][0]                
tf_distil_bert_model (TFDistilB ((None, 300, 768),)  66362880    masked_token[1][0]               
tf_op_layer_strided_slice (Tens multiple             0           tf_distil_bert_model[1][0]       
efficientnetb5_input (InputLaye [(None, 456, 456, 3) 0                                            
batch_normalization (BatchNorma (None, 768)          3072        tf_op_layer_strided_slice[1][0]  
efficientnetb5 (Functional)     (None, 15, 15, 2048) 28513527    efficientnetb5_input[0][0]       
dense (Dense)                   (None, 256)          196864      batch_normalization[1][0]        
global_average_pooling2d (Globa (None, 2048)         0           efficientnetb5[1][0]             
dense_1 (Dense)                 (None, 140)          35980       dense[1][0]                      
dense_3 (Dense)                 (None, 140)          286860      global_average_pooling2d[1][0]   
concatenate (Concatenate)       (None, 280)          0           dense_1[1][0]                    
dense_4 (Dense)                 (None, 100)          28100       concatenate[0][0]                
dropout_20 (Dropout)            (None, 100)          0           dense_4[0][0]                    
dense_5 (Dense)                 (None, 20)           2020        dropout_20[0][0]                 
Total params: 95,429,303
Trainable params: 30,120
Non-trainable params: 95,399,183



This seems to be a memory issue due to a leak. First, you are able to run the model in constant epoch time for a small batch BUT with complete data, the epoch times increase progressively (with increasing time/step too!). I am assuming that as you run out of memory, it is causing increase in epoch times due to limited resources. Upon some amount of web searching, it seems that others who have had memory leaks in keras have had similar ‘symptoms’ w.r.t epoch times.

Check this link for example –

POST TITLE – "Running out of memory when training Keras LSTM model for binary classification on image sequences"

Using TensorFlow backend.
Epoch 1/60
1/1 [==============================] - 16s 16s/step - loss: 0.7258 - acc: 0.5400 - val_loss: 0.7119 - val_acc: 0.6200
Epoch 2/60
1/1 [==============================] - 18s 18s/step - loss: 0.7301 - acc: 0.4800 - val_loss: 0.7445 - val_acc: 0.4000
Epoch 3/60
1/1 [==============================] - 21s 21s/step - loss: 0.7312 - acc: 0.4200 - val_loss: 0.7411 - val_acc: 0.4200
( continues...)

Notice the progressively increasing epoch times?


One way you can check your memory usage (without using tensorboard) is by using callbacks. Here is a dummy code I made that could assist you in getting the memory usage after each epoch using a callback.

import numpy as np
import tensorflow.keras as keras
import resource

class MemoryCallback(keras.callbacks.Callback):
    def on_epoch_end(self, epoch, log={}):

def build_model(shape):
    f_input = keras.layers.Input(shape=(shape[1],))  # (100,)
    d1 = keras.layers.Dense(50, activation='tanh')(f_input)
    d1 = keras.layers.Dense(50, activation='tanh')(d1)
    softmax = keras.layers.Dense(10, activation='softmax')(d1)
    return keras.Model(f_input, softmax)

data = np.random.random((1000, 100))
model = build_model(data.shape)
model.compile(loss='mse', optimizer='SGD'), y=np.random.random((1000,)), verbose=0, epochs=10, callbacks=[MemoryCallback()])

Make sure to set verbose=0 here. Make sure you restart your kernel / python IDE and then run this so that memory is cleared before checking. I am assuming you should see this number increase progressively.

If you are on Mac or Linux, you can also use HTOP to see your memory usage while the model is running. You should see the memory cap to max. You can install HTOP with brew install on mac or sudo apt-get for linux.

enter image description here


A solution to this problem is suggested in this stackoverflow post, that helps limit the memory that tf uses.

import tensorflow as tf
from keras.backend.tensorflow_backend import set_session

config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.9 # fraction of memory
config.gpu_options.visible_device_list = "0"

Answered By: Akshay Sehgal
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.