LSTM predicts mean value, how to solve this?

Question:

EDIT:

Thank you guys for all your input, I’m not sure if the case is resolved but it seems so.

In my former Data preparation function I have shuffled the training sequences, which resulted in LSTM predicting an average.
I was browsing the internet and I have found by accident that other people do not shuffle their data.

I’m not sure if not shuffling the data is ok – it seems strange to me, and I couldn’t find the 0-1 answer on this topic, but when I tried, the LSTM infact did well on test dataset:
enter image description here

Can someone please elaborate why shuffling the data criplles the model?
Or not shuffling the data in case of LSTM is just as bad as in case of other models?

I am trying to make an LSTM to predict the next value of an indicator but it predicts mean.

Data:
(Note: Data preparation function is on the bottom of the post so the post itself will be more readable)
I have around 25 000 entries in each data record and I have 14 columns of characteristics.
So my main array is 25 000 x 14.
When I prepare my data I am creating sequences in a shape of [number of sequences, samples in a sequence, features] and from then on 6 sets of data:

  1. X_train, Y_train
  2. X_valid, Y_valid
  3. X_test, Y_test

Where Y test is the one step ahead value of a feature I am trying to predict.
Note:
All datasets are scaled with MinMaxScaler in range (-1, 1) hence some data is below zero.

The value I am trying to predicts behaves in a following manner (previous values are inside X datasets):
How data I am trying to predict looks like

Example of the data sample:
(Hence, different level of values I’ve plotted some series on another chart):

enter image description here

The Problem:

The problem is that no matter how many neurons, layers, what activation functions I use it predicts the mean value of a characteristic no matter what, and basically when the neural net hits loss of value around 0.078 the loss stops decreasing, If I waint longer and give it more epochs on the same learning rate sometimes loss skyrockets to ‘NaN or 10^30.

Here is my Model:

X_train, Y_train, X_valid, Y_valid, X_test, Y_test, scaler = prepare_datasets_lstm_backup(dataset=dataset, samples=200)

optimizer = keras.optimizers.Adam(learning_rate=0.001)
initializer = keras.initializers.he_normal

model = keras.models.Sequential()
model.add(keras.layers.LSTM(64, activation='relu', input_shape=(200, 14), return_sequences=True))
model.add(keras.layers.LSTM(64, activation='relu', return_sequences=True))

model.add(keras.layers.LSTM(3, kernel_regularizer='l2', bias_regularizer='l2', return_sequences=False))
model.add(keras.layers.Activation('sigmoid'))

model.compile(loss='mse', optimizer=optimizer)

history = model.fit(X_train, Y_train, epochs=10, validation_data=(X_valid, Y_valid))

plt.plot(history.history['loss'], label='loss')
plt.plot(history.history['val_loss'], label='val_loss')
plt.legend()
plt.xlabel('Epochs')
plt.ylabel('loss function value')
plt.grid()
plt.show()

prediction = model.predict(X_test)

The possible solution

While simply increasing number of neurons and layers didn’t helped I found a post on CrossValidated stackexchange forum:
https://stats.stackexchange.com/questions/261704/training-a-neural-network-for-regression-always-predicts-the-mean
Where I’ve read two important things, below I will describe in short words what I’ve read but you can go and check out these answers:

  1. go to the @mhdadk’s answer and check it out.
  2. got to the Bob’s answer and check it out.

So the conculsion is that maybe my Neural Network is not complex enough even with 1000 neurons in two layers. It would certainly be interesting to chekc out the neural net with 10000 neurons and see if it works, but the problem is I would have to run it on Google Cloud VM, where it would propably compute for a month hence I have a limit of 8 CPU’s per VM.

The first Question:

Is it even worth trying to build a neural net with 10k-50k neurons, since I have no idea if it would bring some positive results, and if not I would wast 500 USD or 1000 USD or even more and a month or more of time.
What do you think?

Second Question

If predicting the raw value seems undoable, then could the neural network actually work with classification i.e. predicting if the next value will be between certain treshold, above it or beneath it? Or would it predict mean value of class also i.e. most frequent class for all predictions?

Third question

Can it be the case that I am feeding a neural net with too much data and limitng the data to 5 000 or 10 000 entries would help?

Fourth question

Do you have any other ideas that might help with the prediction?

Thank you all for your time reading this and thank you for your help in advance 🙂

As I wrote above here is the data preparation function:

def prepare_datasets_lstm(dataset : pd.DataFrame, samples : int):
    main_data_df = dataset.copy()

    main_data_df = main_data_df.dropna(how='any').copy()

    main_data_df = main_data_df[main_data_df.columns[~main_data_df.columns.isin(['timestamp', 'datetime'])]].copy()

    main_data_np = main_data_df.copy().to_numpy(dtype='float32')


    scaler = StandardScaler()

    signal_data = main_data_np[:, 5]

    main_data_scaled = scaler.fit_transform(main_data_np.copy())
    joblib.dump(scaler, 'lstm_scaler.save')


    samples_val = samples
    sequences_val = (main_data_scaled.shape[0] - samples_val) - 1
    columns_val = main_data_scaled.shape[1]
    # seqeunces = np.empty((liczba sekwencji, liczba sampli w sekwencji, kolumny))
    seqeunces = np.empty((sequences_val + 1, samples_val, columns_val))
    # etiquets = np.empty((liczba sekwencji - 1, 1 element w sekwencji, liczba przewidywanych wartości))
    etiquets = np.empty((sequences_val, 1, 1))
    for i in range(sequences_val + 1):
        for j in range(samples_val):
            for k in range(columns_val):
                seqeunces[i, j, k] = main_data_scaled[i + j, k]

    for i in range(sequences_val):
        etiquets[i, 0, 0] = signal_data[i]#seqeunces[i + 1, 0, 5] # CCI


    seqeunces = seqeunces[:-1, :, :].copy()

    shape_x = main_data_scaled.shape[0]
    train_len = math.floor(0.7 * shape_x)
    valid_len = math.floor((shape_x - train_len) * 0.5) + train_len
    train_dataset = seqeunces[:train_len, :, :].copy()
    train_etiquets = etiquets[:train_len, :, :].copy()
    valid_dataset = seqeunces[train_len : valid_len, :, :].copy()
    valid_etiquets = etiquets[train_len : valid_len, :, :].copy()
    test_dataset = seqeunces[valid_len:, :, :].copy()
    test_etiquets = etiquets[valid_len:, :, :].copy()

    train_etiquets_shuffled, train_dataset_shuffled = shuffle((train_dataset, train_etiquets), random_state=0)
    valid_etiquets_shuffled, valid_dataset_shuffled = shuffle((valid_dataset, valid_etiquets), random_state=0)


    X_train =  train_dataset_shuffled.copy()
    Y_train = train_etiquets_shuffled.copy()
    X_valid = valid_dataset_shuffled.copy()
    Y_valid = valid_etiquets_shuffled.copy()
    X_test = test_dataset.copy()
    Y_test = test_etiquets.copy()


    return X_train, Y_train, X_valid, Y_valid, X_test, Y_test, scaler
Asked By: Jakub Szurlej

||

Answers:

Your best bet is probably creating smaller models, like simple deep neural networks with very little neurons (< 50) and seeing how good it gets, iterate with diffrent learning rates, like a lot.
adding komplexity rarely helps when developing a model from scratch ..
once you have a simple working model, adding komplexity is easy, but for seeing what works it’s best to start small

Answered By: Anton

Q: I am trying to make an LSTM to predict the next value of an indicator but it predicts mean.

A: LSTM and another type of neurons predict value from input change in their scope, see of my example in dense but LSTM is the same ( I just wrote and test in a minute ) – The question, he is trying to predict sequeces not mean value output and I example of 10 input to output sequences.

Q 1: Is it even worth trying to build a neural net with 10k-50k neurons ?

A: Possible but the objective is working with historical data you can train the model with your optimizers, working and feedback with in real-time response.

Q 2: If predicting the raw value seems undoable, then could the neural network actually work with classification i.e. predicting if the next value will be between certain treshold, above it or beneath it? Or would it predict mean value of class also i.e. most frequent class for all predictions?

A: Possible with scales in-out or ranges, they are dataset, scopes or significants applied.

Q 3: Can it be the case that I am feeding a neural net with too much data and limiting the data to 5 000 or 10 000 entries would help?

A: Possible it makes you train faster and more reliable results but the entire historical is for studying for patterns it can repeat.

Q 4: Do you have any other ideas that might help with the prediction?

A: Better model, Math problems, Input data, Data Significants and Research study.

Sample : Ice-Cream Sunday song, easy as ice cream but fashion in results.

import os
from os.path import exists

import tensorflow as tf
import tensorflow_text as tft

import matplotlib.pyplot as plt

"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
None
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
physical_devices = tf.config.experimental.list_physical_devices('GPU')
assert len(physical_devices) > 0, "Not enough GPU hardware devices available"
config = tf.config.experimental.set_memory_growth(physical_devices[0], True)
print(physical_devices)
print(config)

"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
: Variables
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
input_word = tf.constant(' 'Cause it's easy as an ice cream sundae Slipping outta your hand into the dirt Easy as an ice cream sundae Every dancer gets a little hurt Easy as an ice cream sundae Slipping outta your hand into the dirt Easy as an ice cream sundae Every dancer gets a little hurt Easy as an ice cream sundae Oh, easy as an ice cream sundae ')
dataset = tf.data.Dataset.from_tensors( tf.strings.bytes_split(input_word) )

window_size = 6
vocab = [ "a", "b", "c", "d", "e", "f", "g", "h", "I", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "_" ]
layer = tf.keras.layers.StringLookup(vocabulary=vocab)
ZeroPadding1D = tf.keras.layers.ZeroPadding1D(padding=(2))

list_output = [ ]
list_label = [ ]

"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
: Class and Functions
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
class MyDenseLayer(tf.keras.layers.Layer):
    
    def __init__(self, num_outputs):
        super(MyDenseLayer, self).__init__()
        self.num_outputs = num_outputs
        
    def build(self, input_shape):

        self.kernel = self.add_weight("kernel",
        shape=[int(10),
        self.num_outputs])

    def call(self, inputs):
        return tf.matmul(inputs, self.kernel)

"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
: Datasets
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
dataset = dataset.map(lambda x: tft.sliding_window(x, width=window_size, axis=0)).flat_map(tf.data.Dataset.from_tensor_slices)


for sample in dataset:
    inputs_vocab = tf.constant( tf.cast( layer( sample ), dtype=tf.float32 ), shape=(1, 6, 1) )
    result = tf.constant( ZeroPadding1D( inputs_vocab ), shape=(10, 1) )

    list_output.append( [ int( result.numpy()[0] ), int( result.numpy()[1] ), int( result.numpy()[2] ), int( result.numpy()[3] ), int( result.numpy()[4] ),
        int( result.numpy()[5] ), int( result.numpy()[6] ), int( result.numpy()[7] ), int( result.numpy()[8] ), int( result.numpy()[9] ) ] ) 
        
    list_label.append(
        [ int( result.numpy()[0] ), int( result.numpy()[1] ), int( result.numpy()[2] ), int( result.numpy()[3] ), int( result.numpy()[4] ),
        int( result.numpy()[5] ), int( result.numpy()[6] ), int( result.numpy()[7] ), int( result.numpy()[8] ), int( result.numpy()[9] ) ]
    )
        

print( list_label )

start = 0
limit = 322
X = tf.range(start, limit, delta=1, dtype=tf.int32, name='range')
    
fig = plt.figure(1) #identifies the figure 
plt.title("Word and Time", fontsize='16')   #title
plt.plot( X, list_output )  
plt.show()

"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
: Training
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
layer_dense = MyDenseLayer(10)

model = tf.keras.Sequential([
    tf.keras.Input(shape=(1, 10)),
    layer_dense,
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(192, activation='relu'),
    tf.keras.layers.Dense(10),
])

model.summary()

"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
: Callback
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
class custom_callback(tf.keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs={}):
        if( logs['accuracy'] >= 0.80 ):
            self.model.stop_training = True
    
custom_callback = custom_callback()
 
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
: Optimizer
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
optimizer = tf.keras.optimizers.Nadam(
    learning_rate=0.00001, beta_1=0.9, beta_2=0.999, epsilon=1e-07,
    name='Nadam'
)

"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
: Loss Fn
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""                               
lossfn = tf.keras.losses.MeanSquaredError(
    reduction=tf.keras.losses.Reduction.AUTO,
    name='mean_squared_error'
)

"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
: Model Summary
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
model.compile(optimizer=optimizer, loss=lossfn, metrics=['accuracy'])

"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
: Datasets
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
dataset = tf.data.Dataset.from_tensor_slices(( tf.constant(list_output, shape=(322, 1, 1, 10)), tf.constant(list_label, shape=(322, 1, 1, 10) )))

"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
: Training
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
history = model.fit( dataset, batch_size=100, epochs=500, callbacks=[custom_callback] )

"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
: Prediction
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
temp = tf.constant( model.predict( tf.constant(list_output, shape=(322, 1, 10)) ), shape=(322, 10) )


fig = plt.figure(2) #identifies the figure 
plt.title("Word and Time", fontsize='16')   #title
plt.plot( X, temp ) 
plt.show()

Output:

Epoch 38/500
322/322 [==============================] - 2s 7ms/step - loss: 11.6526 - accuracy: 0.7671
Epoch 39/500
322/322 [==============================] - 2s 7ms/step - loss: 11.1693 - accuracy: 0.7795
Epoch 40/500
322/322 [==============================] - 2s 7ms/step - loss: 10.7022 - accuracy: 0.7919
Epoch 41/500
322/322 [==============================] - 2s 6ms/step - loss: 10.2527 - accuracy: 0.8106
11/11 [==============================] - 0s 2ms/step

Sample
Sample

Answered By: Jirayu Kaewprateep