Training gets progressively slower over time

Question

This is the first time I’m experiencing this issue. I’ve been using this model for a while, but with less data. The problem is that in the first 3 epochs training took 11 sec/step (31k samples / 128 batch size) while in the 4-th epoch it took 18 sec/step. In the fifth it took about 45 sec/step. I’m using Keras and not doing any custom loop shenanigans.

Can someone explain this slowdown? The model hasn’t been interrupted. I’m using TF 2.3

Epoch 1/1200
248/248 [==============================] - 2727s 11s/step - loss: 2.3481 - acc: 0.3818 - top3_acc: 0.5751 - recall: 0.2228 - precision: 0.6195 - f1: 0.3239 - val_loss: 0.9020 - val_acc: 0.8085 - val_top3_acc: 0.8956 - val_recall: 0.5677 - val_precision: 0.9793 - val_f1: 0.7179
Epoch 2/1200
248/248 [==============================] - 2712s 11s/step - loss: 1.0319 - acc: 0.7203 - top3_acc: 0.8615 - recall: 0.5489 - precision: 0.9245 - f1: 0.6865 - val_loss: 0.5547 - val_acc: 0.8708 - val_top3_acc: 0.9371 - val_recall: 0.7491 - val_precision: 0.9661 - val_f1: 0.8435
Epoch 3/1200
248/248 [==============================] - 4426s 18s/step - loss: 0.7094 - acc: 0.8093 - top3_acc: 0.9178 - recall: 0.6830 - precision: 0.9446 - f1: 0.7920 - val_loss: 0.4399 - val_acc: 0.8881 - val_top3_acc: 0.9567 - val_recall: 0.8140 - val_precision: 0.9606 - val_f1: 0.8808
Epoch 4/1200
 18/248 [=>............................] - ETA: 3:14:16 - loss: 0.6452 - acc: 0.8338 - top3_acc: 0.9223 - recall: 0.7257 - precision: 0.9536 - f1: 0.8240

Edit: I just ran this on a super small sample(20 items / category) of the data and the step time does not increase. proof

Edit 2: Model summary

Model: "functional_3"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_token (InputLayer)        [(None, 300)]        0                                            
__________________________________________________________________________________________________
masked_token (InputLayer)       multiple             0           input_token[0][0]                
__________________________________________________________________________________________________
tf_distil_bert_model (TFDistilB ((None, 300, 768),)  66362880    masked_token[1][0]               
__________________________________________________________________________________________________
tf_op_layer_strided_slice (Tens multiple             0           tf_distil_bert_model[1][0]       
__________________________________________________________________________________________________
efficientnetb5_input (InputLaye [(None, 456, 456, 3) 0                                            
__________________________________________________________________________________________________
batch_normalization (BatchNorma (None, 768)          3072        tf_op_layer_strided_slice[1][0]  
__________________________________________________________________________________________________
efficientnetb5 (Functional)     (None, 15, 15, 2048) 28513527    efficientnetb5_input[0][0]       
__________________________________________________________________________________________________
dense (Dense)                   (None, 256)          196864      batch_normalization[1][0]        
__________________________________________________________________________________________________
global_average_pooling2d (Globa (None, 2048)         0           efficientnetb5[1][0]             
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 140)          35980       dense[1][0]                      
__________________________________________________________________________________________________
dense_3 (Dense)                 (None, 140)          286860      global_average_pooling2d[1][0]   
__________________________________________________________________________________________________
concatenate (Concatenate)       (None, 280)          0           dense_1[1][0]                    
                                                                 dense_3[1][0]                    
__________________________________________________________________________________________________
dense_4 (Dense)                 (None, 100)          28100       concatenate[0][0]                
__________________________________________________________________________________________________
dropout_20 (Dropout)            (None, 100)          0           dense_4[0][0]                    
__________________________________________________________________________________________________
dense_5 (Dense)                 (None, 20)           2020        dropout_20[0][0]                 
==================================================================================================
Total params: 95,429,303
Trainable params: 30,120
Non-trainable params: 95,399,183

Asked By: Christo S. Christov

||

Source

Answer 1

Symptoms:

This seems to be a memory issue due to a leak. First, you are able to run the model in constant epoch time for a small batch BUT with complete data, the epoch times increase progressively (with increasing time/step too!). I am assuming that as you run out of memory, it is causing increase in epoch times due to limited resources. Upon some amount of web searching, it seems that others who have had memory leaks in keras have had similar ‘symptoms’ w.r.t epoch times.

Check this link for example –

POST TITLE – "Running out of memory when training Keras LSTM model for binary classification on image sequences"

Using TensorFlow backend.
Epoch 1/60
1/1 [==============================] - 16s 16s/step - loss: 0.7258 - acc: 0.5400 - val_loss: 0.7119 - val_acc: 0.6200
Epoch 2/60
1/1 [==============================] - 18s 18s/step - loss: 0.7301 - acc: 0.4800 - val_loss: 0.7445 - val_acc: 0.4000
Epoch 3/60
1/1 [==============================] - 21s 21s/step - loss: 0.7312 - acc: 0.4200 - val_loss: 0.7411 - val_acc: 0.4200
(...training continues...)

Notice the progressively increasing epoch times?

Diagnosis:

One way you can check your memory usage (without using tensorboard) is by using callbacks. Here is a dummy code I made that could assist you in getting the memory usage after each epoch using a callback.

import numpy as np
import tensorflow.keras as keras
import resource

class MemoryCallback(keras.callbacks.Callback):
    def on_epoch_end(self, epoch, log={}):
        print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)

def build_model(shape):
    f_input = keras.layers.Input(shape=(shape[1],))  # (100,)
    d1 = keras.layers.Dense(50, activation='tanh')(f_input)
    d1 = keras.layers.Dense(50, activation='tanh')(d1)
    softmax = keras.layers.Dense(10, activation='softmax')(d1)
    return keras.Model(f_input, softmax)

data = np.random.random((1000, 100))
model = build_model(data.shape)
model.compile(loss='mse', optimizer='SGD')
model.fit(x=data, y=np.random.random((1000,)), verbose=0, epochs=10, callbacks=[MemoryCallback()])

Make sure to set verbose=0 here. Make sure you restart your kernel / python IDE and then run this so that memory is cleared before checking. I am assuming you should see this number increase progressively.

If you are on Mac or Linux, you can also use HTOP to see your memory usage while the model is running. You should see the memory cap to max. You can install HTOP with brew install on mac or sudo apt-get for linux.

Solution:

A solution to this problem is suggested in this stackoverflow post, that helps limit the memory that tf uses.

import tensorflow as tf
from keras.backend.tensorflow_backend import set_session

config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.9 # fraction of memory
config.gpu_options.visible_device_list = "0"

set_session(tf.Session(config=config))

Answered By: Akshay Sehgal

Training gets progressively slower over time

Question:

Answers: