Neural Network LSTM input shape from dataframe

Question:

I am trying to implement an LSTM with Keras.

I know that LSTM’s in Keras require a 3D tensor with shape (nb_samples, timesteps, input_dim) as an input. However, I am not entirely sure how the input should look like in my case, as I have just one sample of T observations for each input, not multiple samples, i.e. (nb_samples=1, timesteps=T, input_dim=N). Is it better to split each of my inputs into samples of length T/M? T is around a few million observations for me, so how long should each sample in that case be, i.e., how would I choose M?

Also, am I right in that this tensor should look something like:

[[[a_11, a_12, ..., a_1M], [a_21, a_22, ..., a_2M], ..., [a_N1, a_N2, ..., a_NM]], 
 [[b_11, b_12, ..., b_1M], [b_21, b_22, ..., b_2M], ..., [b_N1, b_N2, ..., b_NM]], 
 ..., 
 [[x_11, x_12, ..., a_1M], [x_21, x_22, ..., x_2M], ..., [x_N1, x_N2, ..., x_NM]]]

where M and N defined as before and x corresponds to the last sample that I would have obtained from splitting as discussed above?

Finally, given a pandas dataframe with T observations in each column, and N columns, one for each input, how can I create such an input to feed to Keras?

Asked By: dreamer

||

Answers:

Tensor shape

You’re right that Keras is expecting a 3D tensor for an LSTM neural network, but I think the piece you are missing is that Keras expects that each observation can have multiple dimensions.

For example, in Keras I have used word vectors to represent documents for natural language processing. Each word in the document is represented by an n-dimensional numerical vector (so if n = 2 the word ‘cat’ would be represented by something like [0.31, 0.65]). To represent a single document, the word vectors are lined up in sequence (e.g. ‘The cat sat.’ = [[0.12, 0.99], [0.31, 0.65], [0.94, 0.04]]). A document would be a single sample in a Keras LSTM.

This is analogous to your time series observations. A document is like a time series, and a word is like a single observation in your time series, but in your case it’s just that the representation of your observation is just n = 1 dimensions.

Because of that, I think your tensor should be something like [[[a1], [a2], ... , [aT]], [[b1], [b2], ..., [bT]], ..., [[x1], [x2], ..., [xT]]], where x corresponds to nb_samples, timesteps = T, and input_dim = 1, because each of your observations is only one number.

Batch size

Batch size should be set to maximize throughput without exceeding the memory capacity on your machine, per this Cross Validated post. As far as I know your input does not need to be a multiple of your batch size, neither when training the model and making predictions from it.

Examples

If you’re looking for sample code, on the Keras Github there are a number of examples using LSTM and other network types that have sequenced input.

Answered By: Andrew

Below is an example that sets up time series data to train an LSTM. The model output is nonsense as I only set it up to demonstrate how to build the model.

import pandas as pd
import numpy as np
# Get some time series data
df = pd.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/timeseries.csv")
df.head()

Time series dataframe:

Date      A       B       C      D      E      F      G
0   2008-03-18  24.68  164.93  114.73  26.27  19.21  28.87  63.44
1   2008-03-19  24.18  164.89  114.75  26.22  19.07  27.76  59.98
2   2008-03-20  23.99  164.63  115.04  25.78  19.01  27.04  59.61
3   2008-03-25  24.14  163.92  114.85  27.41  19.61  27.84  59.41
4   2008-03-26  24.44  163.45  114.84  26.86  19.53  28.02  60.09

You can build put inputs into a vector and then use pandas .cumsum() function to build the sequence for the time series:

# Put your inputs into a single list
df['single_input_vector'] = df[input_cols].apply(tuple, axis=1).apply(list)
# Double-encapsulate list so that you can sum it in the next step and keep time steps as separate elements
df['single_input_vector'] = df.single_input_vector.apply(lambda x: [list(x)])
# Use .cumsum() to include previous row vectors in the current row list of vectors
df['cumulative_input_vectors'] = df.single_input_vector.cumsum()

The output can be set up in a similar way, but it will be a single vector instead of a sequence:

# If your output is multi-dimensional, you need to capture those dimensions in one object
# If your output is a single dimension, this step may be unnecessary
df['output_vector'] = df[output_cols].apply(tuple, axis=1).apply(list)

The input sequences have to be the same length to run them through the model, so you need to pad them to be the max length of your cumulative vectors:

# Pad your sequences so they are the same length
from keras.preprocessing.sequence import pad_sequences

max_sequence_length = df.cumulative_input_vectors.apply(len).max()
# Save it as a list   
padded_sequences = pad_sequences(df.cumulative_input_vectors.tolist(), max_sequence_length).tolist()
df['padded_input_vectors'] = pd.Series(padded_sequences).apply(np.asarray)

Training data can be pulled from the dataframe and put into numpy arrays. Note that the input data that comes out of the dataframe will not make a 3D array. It makes an array of arrays, which is not the same thing.

You can use hstack and reshape to build a 3D input array.

# Extract your training data
X_train_init = np.asarray(df.padded_input_vectors)
# Use hstack to and reshape to make the inputs a 3d vector
X_train = np.hstack(X_train_init).reshape(len(df),max_sequence_length,len(input_cols))
y_train = np.hstack(np.asarray(df.output_vector)).reshape(len(df),len(output_cols))

To prove it:

>>> print(X_train_init.shape)
(11,)
>>> print(X_train.shape)
(11, 11, 6)
>>> print(X_train == X_train_init)
False

Once you have training data you can define the dimensions of your input layer and output layers.

# Get your input dimensions
# Input length is the length for one input sequence (i.e. the number of rows for your sample)
# Input dim is the number of dimensions in one input vector (i.e. number of input columns)
input_length = X_train.shape[1]
input_dim = X_train.shape[2]
# Output dimensions is the shape of a single output vector
# In this case it's just 1, but it could be more
output_dim = len(y_train[0])

Build the model:

from keras.models import Model, Sequential
from keras.layers import LSTM, Dense

# Build the model
model = Sequential()

# I arbitrarily picked the output dimensions as 4
model.add(LSTM(4, input_dim = input_dim, input_length = input_length))
# The max output value is > 1 so relu is used as final activation.
model.add(Dense(output_dim, activation='relu'))

model.compile(loss='mean_squared_error',
              optimizer='sgd',
              metrics=['accuracy'])

Finally you can train the model and save the training log as history:

# Set batch_size to 7 to show that it doesn't have to be a factor or multiple of your sample size
history = model.fit(X_train, y_train,
              batch_size=7, nb_epoch=3,
              verbose = 1)

Output:

Epoch 1/3
11/11 [==============================] - 0s - loss: 3498.5756 - acc: 0.0000e+00     
Epoch 2/3
11/11 [==============================] - 0s - loss: 3498.5755 - acc: 0.0000e+00     
Epoch 3/3
11/11 [==============================] - 0s - loss: 3498.5757 - acc: 0.0000e+00 

That’s it. Use model.predict(X) where X is the same format (other than the number of samples) as X_train in order to make predictions from the model.

Answered By: Andrew
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.