Linear regression for time series

Question:

I am pretty new to Machine Learning and have some confusion, so sorry for trivial question.
I have time series data set, very simple with two columns – Date and Price. I’m predicting the price and want to add some features to my model like moving average for last 10 days. If I split dataset learn:validation 80:20. For the first 80 days I can calculate moving avergage. What about my validation set? Should I use predicted value as input for moving average? Are there ready implementation for such a solution? I’m using python scikit-learn library.

Asked By: mlnoob

||

Answers:

interesting question. It seems like you are creating an autoregressive model, i.e. a model that predicts future values based on previous predictions. As such, you are right in concluding that in the validation set you will need to compute the previous ten-day moving average on the prediction. As far as I know, there is no built-in functionality to do this. It should however not be too difficult to implement. Maybe something like this would work.

s = list(range(80))
predictions = []
for i in range(20):
    ten_day = sum(s[-10:])/10
    pred = predict(ten_day)
    predictions.append(pred)
    s.append(pred)

But I advise you to Google autoregressive models to get some more insight. Also you could have a look at https://stats.stackexchange.com/a/346918 to get some info on how to split the data.

Answered By: Jozef

Ok, here is a solution using 250 data points of GOOG stock Close historical data. I have explained the code with comments. Please feel free to ask if there is something vague in there. As you can see, I use pandas and within that library is a convenience function "rolling" that computes, among other things, rolling means. I split the data set by hand, but it can also be done by e.g. sklearn.model_selection.train_test_split

import pandas as pd
from sklearn.linear_model import LinearRegression
import numpy as np

# Read data from file
df = pd.read_csv("GOOG.csv")

# Calculate 10 day rolling mean and drop first 10 rows because we cannot calculate rolling mean for them
# shift moves the averages one step ahead so day 10 gets moving average of days 0-9, etc...
df["Rolling_10d_close"] = df['Close'].rolling(10).mean().shift(1)
df = df.dropna()

# Split data into training and validation sets
training_last_row = int(len(df) * 0.8)
training_data = df.iloc[:training_last_row]
validation_data = df.iloc[training_last_row:]

# Train model on training set of data
x = training_data["Rolling_10d_close"].to_numpy().reshape(-1, 1)
y = training_data["Close"].to_numpy().reshape(-1, 1)

reg = LinearRegression().fit(x, y)
print(reg.coef_, reg.intercept_)
# prints [[0.95972717]] [4.14010503]

# Test the performance of predictions on the validation data set
x_pred = validation_data["Rolling_10d_close"].to_numpy().reshape(-1, 1)
y_pred = validation_data["Close"].to_numpy().reshape(-1, 1)

print(reg.score(x_pred, y_pred))
# prints 0.02467230502090556
Answered By: kakben