Why is my loss function increasing with each epoch?

Question:

I’m new to ML, so I’m sorry if this is some stupid question anyone could have figured out. I am using TensorFlow and Keras here.

So here’s my code:

import tensorflow as tf
import numpy as np
from tensorflow import keras
model = keras.Sequential([
    keras.layers.Dense(units=1, input_shape=[1])
])
model.compile(optimizer="sgd", loss="mean_squared_error")
xs = np.array([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0], dtype=float)
ys = np.array([0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0, 6.5, 7.0, 7.5, 8.0, 8.5, 9.0, 9.5, 10.0], dtype=float)
model.fit(xs, ys, epochs=500)
print(model.predict([25.0]))

I get this as output [I’m not showing the whole 500 lines, just 20 epochs:

Epoch 1/500
1/1 [==============================] - 0s 210ms/step - loss: 450.9794
Epoch 2/500
1/1 [==============================] - 0s 4ms/step - loss: 1603.0852
Epoch 3/500
1/1 [==============================] - 0s 10ms/step - loss: 5698.4731
Epoch 4/500
1/1 [==============================] - 0s 7ms/step - loss: 20256.3398
Epoch 5/500
1/1 [==============================] - 0s 10ms/step - loss: 72005.1719
Epoch 6/500
1/1 [==============================] - 0s 4ms/step - loss: 255956.5938
Epoch 7/500
1/1 [==============================] - 0s 3ms/step - loss: 909848.5000
Epoch 8/500
1/1 [==============================] - 0s 5ms/step - loss: 3234236.0000
Epoch 9/500
1/1 [==============================] - 0s 3ms/step - loss: 11496730.0000
Epoch 10/500
1/1 [==============================] - 0s 3ms/step - loss: 40867392.0000
Epoch 11/500
1/1 [==============================] - 0s 3ms/step - loss: 145271264.0000
Epoch 12/500
1/1 [==============================] - 0s 3ms/step - loss: 516395584.0000
Epoch 13/500
1/1 [==============================] - 0s 4ms/step - loss: 1835629312.0000
Epoch 14/500
1/1 [==============================] - 0s 3ms/step - loss: 6525110272.0000
Epoch 15/500
1/1 [==============================] - 0s 3ms/step - loss: 23194802176.0000
Epoch 16/500
1/1 [==============================] - 0s 3ms/step - loss: 82450513920.0000
Epoch 17/500
1/1 [==============================] - 0s 3ms/step - loss: 293086593024.0000
Epoch 18/500
1/1 [==============================] - 0s 5ms/step - loss: 1041834835968.0000
Epoch 19/500
1/1 [==============================] - 0s 3ms/step - loss: 3703408164864.0000
Epoch 20/500
1/1 [==============================] - 0s 3ms/step - loss: 13164500484096.0000

As you can see, it is increasing exponentially. Soon (at the 64th epoch), these numbers become inf. And then, from infinity, it does something and becomes NaN (Not a Number). I thought a model will get better at figuring out patterns over time, what is going on?

One thing I noticed, if I reduce the length of xs and ys from 20 to 10, the loss decreases and becomes 7.9193e-05. After I increase the length of both numpy arrays to 18 it starts increasing uncontrollably, otherwise it’s fine. I gave 20 values because I thought the model will be better if I give more data, which is why I gave 20 values.

Asked By: Robo

||

Answers:

It seems that the optimizer SGD doesn’t perform well on your dataset.
if you replace the optimizer with ‘adam’ you should get the result you expected.

model.compile(optimizer="adam", loss="mean_squared_error")

The prediction should then be what you would expect

print(model.predict([25.0]))
# [[12.487587]]

I am not 100% as to why SGD optimizer works so badly.

EDIT:

@MortenJensen (below) provides a good explanation as to why the adam optimizer does better.
Summary: the reason sgd doesn’t do well is that it needs a smaller learning rate. Adam however has an adaptive learning rate.

Answered By: Dominik Sajovic

Your alpha/learning-rate seems to be too big.

Try with a lower learning-rate, like so:

import tensorflow as tf
import numpy as np
from tensorflow import keras
model = keras.Sequential([
    keras.layers.Dense(units=1, input_shape=[1])
])
# manually set the optimizer, default learning_rate=0.01
opt = keras.optimizers.SGD(learning_rate=0.0001)

model.compile(optimizer=opt, loss="mean_squared_error")
xs = np.array([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0], dtype=float)
ys = np.array([0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0, 6.5, 7.0, 7.5, 8.0, 8.5, 9.0, 9.5, 10.0], dtype=float)
model.fit(xs, ys, epochs=500)
print(model.predict([25.0]))

… which will converge.

One of the reasons ADAM works better, is probably because it estimates the learning-rate adaptively – I think the A in ADAM stands for Adaptive ;)).

EDIT: It does!

From https://arxiv.org/pdf/1412.6980.pdf

The method computes individual adaptive learning rates for
different parameters from estimates of first and second moments of the gradients; the name Adam
is derived from adaptive moment estimation

Epoch 1/500
1/1 [==============================] - 0s 129ms/step - loss: 1.2133
Epoch 2/500
1/1 [==============================] - 0s 990us/step - loss: 1.1442
Epoch 3/500
1/1 [==============================] - 0s 0s/step - loss: 1.0792
Epoch 4/500
1/1 [==============================] - 0s 1ms/step - loss: 1.0178
Epoch 5/500
1/1 [==============================] - 0s 1ms/step - loss: 0.9599
Epoch 6/500
1/1 [==============================] - 0s 1ms/step - loss: 0.9053
Epoch 7/500
1/1 [==============================] - 0s 0s/step - loss: 0.8538
Epoch 8/500
1/1 [==============================] - 0s 1ms/step - loss: 0.8053
Epoch 9/500
1/1 [==============================] - 0s 999us/step - loss: 0.7595
Epoch 10/500
1/1 [==============================] - 0s 1ms/step - loss: 0.7163
...
Epoch 499/500
1/1 [==============================] - 0s 1ms/step - loss: 9.9431e-06
Epoch 500/500
1/1 [==============================] - 0s 999us/step - loss: 9.9420e-06

EDIT2:

With true/"vanilla" gradient descent (vs Stochastic GD), you should see convergence at every step. If you start to diverge it’s usually because the alpha/learning-rate/step-size is too big. Which means the search "overshoots" in one, several or all dimensions.

Consider a loss function whose partial-derivative/gradient has a very narrow valley in one or several dimensions. A "small step too far" can mean a large error suddenly.

Answered By: Morten Jensen