Encoder-Decoder LSTM model gives 'nan' loss and predictions


I am trying to create a basic encoder-decoder model for training a chatbot. X contains the questions or human dialogues and Y contains the bot answers. I have padded the sequences to the max size of input and output sentences. X.shape = (2363, 242, 1) and Y.shape = (2363, 144, 1). But during training, the loss has value ‘nan’ for all epochs and the prediction gives array with all values as ‘nan’. I have tried using ‘rmsprop’ optimizer instead of ‘adam’. I cannot use loss function ‘categorical_crossentropy’ as the output is not one-hot encoded but a sequence. What exactly is wrong with my code?


model = Sequential()
model.add(LSTM(units=64, activation='relu', input_shape=(X.shape[1], 1)))
model.add(LSTM(units=64, activation='relu', return_sequences=True))

model.compile(optimizer='adam', loss='mean_squared_error')

hist = model.fit(X, Y, epochs=20, batch_size=64, verbose=2)

Data Preparation

def remove_punctuation(s):
    s = s.translate(str.maketrans('','',string.punctuation))
    s = s.encode('ascii', 'ignore').decode('ascii')
    return s

def prepare_data(fname):
    word2idx = {'PAD': 0}
    curr_idx = 1
    sents = list()
    for line in open(fname):
        line = line.strip()
        if line:
            tokens = remove_punctuation(line.lower()).split()
            tmp = []
            for t in tokens:
                if t not in word2idx:
                    word2idx[t] = curr_idx
                    curr_idx += 1
    sents = np.array(pad_sequences(sents, padding='post'))
    return sents, word2idx

human = 'rdany-conversations/human_text.txt'
robot = 'rdany-conversations/robot_text.txt'

X, input_vocab = prepare_data(human)
Y, output_vocab = prepare_data(robot)

X = X.reshape((X.shape[0], X.shape[1], 1))
Y = Y.reshape((Y.shape[0], Y.shape[1], 1))
Asked By: Hrishikesh Bawane



First of all check that you do not have any NaNs in your input. If this is not the case it might be exploding gradients. Standardize your inputs (MinMax- or Z-scaling), try smaller learning rates, clip the gradients, try a different weight initialization scheme.

Answered By: Tinu