Keras: How to save model and continue training?

Question

I have a model that I’ve trained for 40 epochs. I kept checkpoints for each epochs, and I have also saved the model with model.save(). The code for training is:

n_units = 1000
model = Sequential()
model.add(LSTM(n_units, input_shape=(None, vec_size), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(n_units, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(n_units))
model.add(Dropout(0.2))
model.add(Dense(vec_size, activation='linear'))
model.compile(loss='mean_squared_error', optimizer='adam')
# define the checkpoint
filepath="word2vec-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]
# fit the model
model.fit(x, y, epochs=40, batch_size=50, callbacks=callbacks_list)

However, when I load the model and try training it again, it starts all over as if it hasn’t been trained before. The loss doesn’t start from the last training.

What confuses me is when I load the model and redefine the model structure and use load_weight, model.predict() works well. Thus, I believe the model weights are loaded:

model = Sequential()
model.add(LSTM(n_units, input_shape=(None, vec_size), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(n_units, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(n_units))
model.add(Dropout(0.2))
model.add(Dense(vec_size, activation='linear'))
filename = "word2vec-39-0.0027.hdf5"
model.load_weights(filename)
model.compile(loss='mean_squared_error', optimizer='adam')

However, When I continue training with this, the loss is as high as the initial stage:

filepath="word2vec-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]
# fit the model
model.fit(x, y, epochs=40, batch_size=50, callbacks=callbacks_list)

I searched and found some examples of saving and loading models here and here. However, none of them work.

Update 1

I looked at this question, tried it and it works:

model.save('partly_trained.h5')
del model
load_model('partly_trained.h5')

But when I close Python and reopen it, then run load_model again, it fails. The loss is as high as the initial state.

Update 2

I tried Yu-Yang’s example code and it works. However, when I use my code again, it still failed.

This is result form the original training. The second epoch should start with loss = 3.1***:

13700/13846 [============================>.] - ETA: 0s - loss: 3.0519
13750/13846 [============================>.] - ETA: 0s - loss: 3.0511
13800/13846 [============================>.] - ETA: 0s - loss: 3.0512Epoch 00000: loss improved from inf to 3.05101, saving model to LPT-00-3.0510.h5

13846/13846 [==============================] - 81s - loss: 3.0510    
Epoch 2/60

   50/13846 [..............................] - ETA: 80s - loss: 3.1754
  100/13846 [..............................] - ETA: 78s - loss: 3.1174
  150/13846 [..............................] - ETA: 78s - loss: 3.0745

I closed Python, reopened it, loaded the model with model = load_model("LPT-00-3.0510.h5") then train with:

filepath="LPT-{epoch:02d}-{loss:.4f}.h5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]
# fit the model
model.fit(x, y, epochs=60, batch_size=50, callbacks=callbacks_list)

The loss starts with 4.54:

Epoch 1/60
   50/13846 [..............................] - ETA: 162s - loss: 4.5451
   100/13846 [..............................] - ETA: 113s - loss: 4.3835

Asked By: David

||

Source

Answer 1

As it’s quite difficult to clarify where the problem is, I created a toy example from your code, and it seems to work alright.

import numpy as np
from numpy.testing import assert_allclose
from keras.models import Sequential, load_model
from keras.layers import LSTM, Dropout, Dense
from keras.callbacks import ModelCheckpoint

vec_size = 100
n_units = 10

x_train = np.random.rand(500, 10, vec_size)
y_train = np.random.rand(500, vec_size)

model = Sequential()
model.add(LSTM(n_units, input_shape=(None, vec_size), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(n_units, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(n_units))
model.add(Dropout(0.2))
model.add(Dense(vec_size, activation='linear'))
model.compile(loss='mean_squared_error', optimizer='adam')

# define the checkpoint
filepath = "model.h5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

# fit the model
model.fit(x_train, y_train, epochs=5, batch_size=50, callbacks=callbacks_list)

# load the model
new_model = load_model(filepath)
assert_allclose(model.predict(x_train),
                new_model.predict(x_train),
                1e-5)

# fit the model
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]
new_model.fit(x_train, y_train, epochs=5, batch_size=50, callbacks=callbacks_list)

The loss continues to decrease after model loading. (restarting python also gives no problem)

Using TensorFlow backend.
Epoch 1/5
500/500 [==============================] - 2s - loss: 0.3216     Epoch 00000: loss improved from inf to 0.32163, saving model to model.h5
Epoch 2/5
500/500 [==============================] - 0s - loss: 0.2923     Epoch 00001: loss improved from 0.32163 to 0.29234, saving model to model.h5
Epoch 3/5
500/500 [==============================] - 0s - loss: 0.2542     Epoch 00002: loss improved from 0.29234 to 0.25415, saving model to model.h5
Epoch 4/5
500/500 [==============================] - 0s - loss: 0.2086     Epoch 00003: loss improved from 0.25415 to 0.20860, saving model to model.h5
Epoch 5/5
500/500 [==============================] - 0s - loss: 0.1725     Epoch 00004: loss improved from 0.20860 to 0.17249, saving model to model.h5

Epoch 1/5
500/500 [==============================] - 0s - loss: 0.1454     Epoch 00000: loss improved from inf to 0.14543, saving model to model.h5
Epoch 2/5
500/500 [==============================] - 0s - loss: 0.1289     Epoch 00001: loss improved from 0.14543 to 0.12892, saving model to model.h5
Epoch 3/5
500/500 [==============================] - 0s - loss: 0.1169     Epoch 00002: loss improved from 0.12892 to 0.11694, saving model to model.h5
Epoch 4/5
500/500 [==============================] - 0s - loss: 0.1097     Epoch 00003: loss improved from 0.11694 to 0.10971, saving model to model.h5
Epoch 5/5
500/500 [==============================] - 0s - loss: 0.1057     Epoch 00004: loss improved from 0.10971 to 0.10570, saving model to model.h5

BTW, redefining the model followed by load_weight() definitely won’t work, because save_weight() and load_weight() does not save/load the optimizer.

Answered By: Yu-Yang

Answer 2

I compared my code with this example http://machinelearningmastery.com/text-generation-lstm-recurrent-neural-networks-python-keras/
by carefully block out line-by-line and run again. After a whole day, finally, I found what was wrong.

When making char-int mapping, I used

# title_str_reduced is a string
chars = list(set(title_str_reduced))
# make char to int index mapping
char2int = {}
for i in range(len(chars)):
    char2int[chars[i]] = i

A set is an unordered data structure. In python, when a set is converted to a list which is ordered, the order is randamly given. Thus my char2int dictionary is randomized everytime when I reopen python.
I fixed my code by adding a sorted()

chars = sorted(list(set(title_str_reduced)))

This forces the conversion to a fixed order.

Answered By: David

Answer 3

Here is the official kera’s Documentation to save a model:

https://keras.io/getting-started/faq/#how-can-i-save-a-keras-model

In this post the author provides two examples of saving and loading your model to file as:

JSON format.
YAML foramt.

Answered By: a11apurva

Answer 4

I think you can write

model.save('partly_trained.h5' )

and

model = load_model('partly_trained.h5')

instead of

model = Sequential()
model.add(LSTM(n_units, input_shape=(None, vec_size), return_sequences=True))    
model.add(Dropout(0.2)) 
model.add(LSTM(n_units, return_sequences=True))  
model.add(Dropout(0.2)) 
model.add(LSTM(n_units))
model.add(Dropout(0.2))
model.add(Dense(vec_size, activation='linear')) 
model.compile(loss='mean_squared_error', optimizer='adam')

Then go continuing training.
Because model.save stores both architecture & weights, as you can read in the documentation.

Answered By: bruce

Answer 5

assume you have a code like this:

model = some_model_you_made(input_img) # you compiled your model in this 
model.summary()

model_checkpoint = ModelCheckpoint('yours.h5', monitor='val_loss', verbose=1, save_best_only=True)

model_json = model.to_json()
with open("yours.json", "w") as json_file:
    json_file.write(model_json)

model.fit_generator(#stuff...) # or model.fit(#stuff...)

Now turn your code into this:

model = some_model_you_made(input_img) #same model here
model.summary()

model_checkpoint = ModelCheckpoint('yours.h5', monitor='val_loss', verbose=1, save_best_only=True) #same ckeckpoint

model_json = model.to_json()
with open("yours.json", "w") as json_file:
    json_file.write(model_json)

with open('yours.json', 'r') as f:
    old_model = model_from_json(f.read()) # open the model you just saved (same as your last train) with a different name

old_model.load_weights('yours.h5') # the model checkpoint you trained before
old_model.compile(#stuff...) # need to compile again (exactly like the last compile)

# now start training with the checkpoint...
old_model.fit_generator(#same stuff like the last train) # or model.fit(#stuff...)

Answered By: MeiH

Answer 6

The above answer uses tensorflow 1.x. Here is an updated version using Tensorflow 2.x.

import numpy as np
from numpy.testing import assert_allclose
from tensorflow.keras.models import Sequential, load_model
from tensorflow.keras.layers import LSTM, Dropout, Dense
from tensorflow.keras.callbacks import ModelCheckpoint

vec_size = 100
n_units = 10

x_train = np.random.rand(500, 10, vec_size)
y_train = np.random.rand(500, vec_size)

model = Sequential()
model.add(LSTM(n_units, input_shape=(None, vec_size), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(n_units, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(n_units))
model.add(Dropout(0.2))
model.add(Dense(vec_size, activation='linear'))
model.compile(loss='mean_squared_error', optimizer='adam')

# define the checkpoint
filepath = "model.h5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

# fit the model
model.fit(x_train, y_train, epochs=5, batch_size=50, callbacks=callbacks_list)

# load the model
new_model = load_model("model.h5")
assert_allclose(model.predict(x_train),
                new_model.predict(x_train),
                1e-5)

# fit the model
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]
new_model.fit(x_train, y_train, epochs=5, batch_size=50, callbacks=callbacks_list)

Answered By: Mrinal Jain

Answer 7

The checkmarked Answer is not correct; the real problem is more subtle.

When you create a ModelCheckpoint() , check the best:

cp1 = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
print(cp1.best)

you will see that this is set to np.inf, which unfortunately is not your last best when you stopped training. So when you re-train and recreate the ModelCheckpoint(), if you call fit and if the loss is less than previously known value, then it seems to work, but in more complex problems you will end up saving a bad model and lose the best.

You can fix this by overwriting the cp.best parameter as shown below:

import numpy as np
from numpy.testing import assert_allclose
from keras.models import Sequential, load_model
from keras.layers import LSTM, Dropout, Dense
from keras.callbacks import ModelCheckpoint

vec_size = 100
n_units = 10

x_train = np.random.rand(500, 10, vec_size)
y_train = np.random.rand(500, vec_size)

model = Sequential()
model.add(LSTM(n_units, input_shape=(None, vec_size), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(n_units, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(n_units))
model.add(Dropout(0.2))
model.add(Dense(vec_size, activation='linear'))
model.compile(loss='mean_squared_error', optimizer='adam')

# define the checkpoint
filepath = "model.h5"
cp1= ModelCheckpoint(filepath=filepath, monitor='loss',     save_best_only=True, verbose=1, mode='min')
callbacks_list = [cp1]

# fit the model
model.fit(x_train, y_train, epochs=5, batch_size=50, shuffle=True, validation_split=0.1, callbacks=callbacks_list)

# load the model
new_model = load_model(filepath)
#assert_allclose(model.predict(x_train),new_model.predict(x_train), 1e-5)
score = model.evaluate(x_train, y_train, batch_size=50)
cp1 = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
cp1.best = score # <== ****THIS IS THE KEY **** See source for  ModelCheckpoint

# fit the model
callbacks_list = [cp1]
new_model.fit(x_train, y_train, epochs=5, batch_size=50, callbacks=callbacks_list)

Answered By: user30012

Answer 8

Since Keras and Tensorflow are now bundled, you can use the newer Tensorflow format that will save all model info including the optimizer and its state (from the doc, emphasis mine):

You can save an entire model to a single artifact. It will include:

The model’s architecture/config

The model’s weight values (which were learned during training)

The model’s compilation information (if compile() was called)

The optimizer and its state, if any (this enables you to restart training where you left)

APIs

model.save() or tf.keras.models.save_model()

tf.keras.models.load_model()

So once your model is saved that way, you can load it and resume training: it will continue where it left off.

Answered By: Matthieu

Keras: How to save model and continue training?

Question:

Answers: