Why does my LSTM model predict wrong values although the loss is decreasing?

Question:

I am trying to build a machine learning model which predicts a single number from a series of numbers. I am using an LSTM model with Tensorflow.

You can imagine my dataset to look something like this:

Index x data y data
0 np.array(shape (10000,1) ) numpy.float32
1 np.array(shape (10000,1) ) numpy.float32
2 np.array(shape (10000,1) ) numpy.float32
56 np.array(shape (10000,1) ) numpy.float32

Easily said I just want my model to predict a number (y data) from a sequence of numbers (x data).

For example like this:

  • array([3.59280851, 3.60459062, 3.60459062, …]) => 2.8989773
  • array([3.54752101, 3.56740332, 3.56740332, …]) => 3.0893357

x and y data

From my x data I created a numpy array x_train which I want to use to train the network.
Because I am using an LSTM network, x_train should be of shape (samples, time_steps, features).
I reshaped my x_train array to be shaped like this: (57, 10000, 1), because I have 57 samples, which each are of length 10000 and contain a single number.

The y data was created similarly and is of shape (57,1) because, once again, I have 57 samples which each contain a single number as the desired y output.

Current model attempt

My model summary looks like this:
current model

The model was compiled with model.compile(loss="mse", optimizer="adam") so my loss function is simply the mean squared error and as an optimizer I’m using Adam.

Current results

Training of the model works fine and I can see that the loss and validation loss decreases after some epochs.
The actual problem occurs when I want to predict some data y_verify from some data x_verify.
I do this after the training is finished to determine how well the model is trained.
In the following example I simply used the data I used for training to determine how well the model is trained (I know about overfitting and that verifying with the training set is not the right way of doing it, but that is not the problem I want to demonstrate right not).

In the following graph you can see the y data I provided to the model in blue.
The orange line is the result of calling model.predict(x_verify) where x_verify is of the same shape as x_train.

current results

I also calculated the mean absolute percentage error (MAPE) of my prediction and the actual data and it came out to be around 4% which is not bad, because I only trained for 40 epochs. But this result still is not helpful at all because as you can see in the graph above the curves do not match at all.

Question:

What is going on here?

Am I using an incorrect loss function?

Why does it seem like the model tries to predict a single value for all samples rather than predicting a different value for all samples like it’s supposed to be?

Ideally the prediction should be the y data which I provided so the curves should look the same (more or less).

Do you have any ideas?

Thanks! 🙂

Asked By: Luca Tatas

||

Answers:

From the notebook it seems you are not scaling your data. You should normalize or standardize your data before training your model.

https://machinelearningmastery.com/how-to-improve-neural-network-stability-and-modeling-performance-with-data-scaling/

can add normalization layer in keras https://www.tensorflow.org/api_docs/python/tf/keras/layers/Normalization

Answered By: Muhammad Pathan

After some back and forth in the comments, I’ll give my best estimation to your questions:

What is going on here?

Very complex (too many layers deep) model with very little data, trained for too few epochs on non-normalized data (credit to Muhammad in his answer). The biggest issue, as far as I can tell, is the number of training epochs.

Am I using an incorrect loss function?

MSE is an appropriate loss function for a regression task.

Why does it seem like the model tries to predict a single value for all samples rather than predicting a different value for all samples like it’s supposed to be? Ideally the prediction should be the y data which I provided so the curves should look the same (more or less). Do you have any ideas?

Too few training epochs is the biggest contributor, as far as I can tell.

Based on the collab notebook that Luca shared:

30 Epochs, no normalization

Way off target, flat predictions (though I can’t reproduce how flat the predictions are that Luca posted)

enter image description here

30 Epochs, with normalization

Worse off.

enter image description here

2000(!) epochs, no normalization

Okay, now the predictions are at least in the ballpark

enter image description here

2000 epochs, with normalization

And now the model seems to be starting to figure things out, like we’d hope it should. Given, this is training on the 11 samples that were cobbled together in the notebook, so it’s naturally going to overfit. We’re just happy to see it learn something.

enter image description here

2000 epochs, normalization, different loss

Never be afraid to try out different losses, as some may be better suited than others. Not knowing the domain of this task, I’m just trying out mean_absolute_error instead of mean_squared_error.

enter image description here

Caution! Don’t compare loss values between different losses. They’re not on the same scale.

2000 epochs, normalization, larger learning rate

Okay, so it’s taking a long time to learn. Can I nudge it along a little faster? Sure, up the learning rate of the optimizer, and it’ll get you to where you’re going faster. Here, we up it by a factor of 5.

model.compile(loss="mse", optimizer=tf.keras.optimizers.Adam(learning_rate=0.005))

You could even employ a learning rate scheduler that starts big and slowly diminishes it over the course of epochs.

def scheduler(epoch, lr):
   if epoch < 400:
     return lr
   else:
     return lr * tf.math.exp(-0.01)

lrs = tf.keras.callbacks.LearningRateScheduler(scheduler)
history = model.fit(x=x_train, y=y_train, epochs=1000, callbacks=[lrs])

Hope this all helps!

Answered By: Bryan Dannowitz

i think it would be helpful if you change loss into huber loss and even change optimizer into sgd and then first try out to define the best learning rate based on a callback (learning rate schedule) cause of small dataset and even normalize or standardize data before training model.

Answered By: Soroush Mirzaei

I just wanted to post a quick update.
First of all, this is my current result:
current results

I am absolutely happy, that I was finally able to achieve what I wanted to. At least to some extent.

There were some steps I had to take to achieve this result:

  • Normalization
  • Training for 500-1000 epochs
  • Most importantly: Reducing the amount of time steps to 1000

In the end my thought of "the more data, the better" was a huge misconception. I was not able to achieve such results with 10000 time steps per sample AT ALL. So I’m glad that I just gave 1000 a shot.

Thank you all very much for your answers!
I will try to further imroved my model with your suggestions 🙂

Answered By: Luca Tatas