How to predict time series in scikit-learn?
Question:
Scikit-learn utilizes a very convenient approach based on fit
and predict
methods. I have time-series data in the format suited for fit
and predict
.
For example I have the following Xs
:
[[1.0, 2.3, 4.5], [6.7, 2.7, 1.2], ..., [3.2, 4.7, 1.1]]
and the corresponding ys
:
[[1.0], [2.3], ..., [7.7]]
These data have the following meaning. The values stored in ys
form a time series. The values in Xs
are corresponding time dependent “factors” that are known to have some influence on the values in ys
(for example: temperature, humidity and atmospheric pressure).
Now, of course, I can use fit(Xs,ys)
. But then I get a model in which future values in ys
depend only on factors and do not dependend on the previous Y
values (at least directly) and this is a limitation of the model. I would like to have a model in which Y_n
depends also on Y_{n-1}
and Y_{n-2}
and so on. For example I might want to use an exponential moving average as a model. What is the most elegant way to do it in scikit-learn
ADDED
As it has been mentioned in the comments, I can extend Xs
by adding ys
. But this way has some limitations. For example, if I add the last 5 values of y
as 5 new columns to X
, the information about time ordering of ys
is lost. For example, there is no indication in X
that values in the 5th column follows value in the 4th column and so on. As a model, I might want to have a linear fit of the last five ys
and use the found linear function to make a prediction. But if I have 5 values in 5 columns it is not so trivial.
ADDED 2
To make my problem even more clear, I would like to give one concrete example. I would like to have a “linear” model in which y_n = c + k1*x1 + k2*x2 + k3*x3 + k4*EMOV_n
, where EMOV_n is just an exponential moving average. How, can I implement this simple model in scikit-learn?
Answers:
This might be what you’re looking for, with regard to the exponentially weighted moving average:
import pandas, numpy
ewma = pandas.stats.moments.ewma
EMOV_n = ewma( ys, com=2 )
Here, com
is a parameter that you can read about here. Then you can combine EMOV_n
to Xs
, using something like:
Xs = numpy.vstack((Xs,EMOV_n))
And then you can look at various linear models, here, and do something like:
from sklearn import linear_model
clf = linear_model.LinearRegression()
clf.fit ( Xs, ys )
print clf.coef_
Best of luck!
According to Wikipedia, EWMA works well with stationary data, but it does not work as expected in the presence of trends, or seasonality. In those cases you should use a second or third order EWMA method, respectively. I decided to look at the pandas ewma
function to see how it handled trends, and this is what I came up with:
import pandas, numpy as np
ewma = pandas.stats.moments.ewma
# make a hat function, and add noise
x = np.linspace(0,1,100)
x = np.hstack((x,x[::-1]))
x += np.random.normal( loc=0, scale=0.1, size=200 )
plot( x, alpha=0.4, label='Raw' )
# take EWMA in both directions with a smaller span term
fwd = ewma( x, span=15 ) # take EWMA in fwd direction
bwd = ewma( x[::-1], span=15 ) # take EWMA in bwd direction
c = np.vstack(( fwd, bwd[::-1] )) # lump fwd and bwd together
c = np.mean( c, axis=0 ) # average
# regular EWMA, with bias against trend
plot( ewma( x, span=20 ), 'b', label='EWMA, span=20' )
# "corrected" (?) EWMA
plot( c, 'r', label='Reversed-Recombined' )
legend(loc=8)
savefig( 'ewma_correction.png', fmt='png', dpi=100 )
As you can see, the EWMA bucks the trend uphill and downhill. We can correct for this (without having to implement a second-order scheme ourselves) by taking the EWMA in both directions and then averaging. I hope your data was stationary!
Scikit-learn utilizes a very convenient approach based on fit
and predict
methods. I have time-series data in the format suited for fit
and predict
.
For example I have the following Xs
:
[[1.0, 2.3, 4.5], [6.7, 2.7, 1.2], ..., [3.2, 4.7, 1.1]]
and the corresponding ys
:
[[1.0], [2.3], ..., [7.7]]
These data have the following meaning. The values stored in ys
form a time series. The values in Xs
are corresponding time dependent “factors” that are known to have some influence on the values in ys
(for example: temperature, humidity and atmospheric pressure).
Now, of course, I can use fit(Xs,ys)
. But then I get a model in which future values in ys
depend only on factors and do not dependend on the previous Y
values (at least directly) and this is a limitation of the model. I would like to have a model in which Y_n
depends also on Y_{n-1}
and Y_{n-2}
and so on. For example I might want to use an exponential moving average as a model. What is the most elegant way to do it in scikit-learn
ADDED
As it has been mentioned in the comments, I can extend Xs
by adding ys
. But this way has some limitations. For example, if I add the last 5 values of y
as 5 new columns to X
, the information about time ordering of ys
is lost. For example, there is no indication in X
that values in the 5th column follows value in the 4th column and so on. As a model, I might want to have a linear fit of the last five ys
and use the found linear function to make a prediction. But if I have 5 values in 5 columns it is not so trivial.
ADDED 2
To make my problem even more clear, I would like to give one concrete example. I would like to have a “linear” model in which y_n = c + k1*x1 + k2*x2 + k3*x3 + k4*EMOV_n
, where EMOV_n is just an exponential moving average. How, can I implement this simple model in scikit-learn?
This might be what you’re looking for, with regard to the exponentially weighted moving average:
import pandas, numpy
ewma = pandas.stats.moments.ewma
EMOV_n = ewma( ys, com=2 )
Here, com
is a parameter that you can read about here. Then you can combine EMOV_n
to Xs
, using something like:
Xs = numpy.vstack((Xs,EMOV_n))
And then you can look at various linear models, here, and do something like:
from sklearn import linear_model
clf = linear_model.LinearRegression()
clf.fit ( Xs, ys )
print clf.coef_
Best of luck!
According to Wikipedia, EWMA works well with stationary data, but it does not work as expected in the presence of trends, or seasonality. In those cases you should use a second or third order EWMA method, respectively. I decided to look at the pandas ewma
function to see how it handled trends, and this is what I came up with:
import pandas, numpy as np
ewma = pandas.stats.moments.ewma
# make a hat function, and add noise
x = np.linspace(0,1,100)
x = np.hstack((x,x[::-1]))
x += np.random.normal( loc=0, scale=0.1, size=200 )
plot( x, alpha=0.4, label='Raw' )
# take EWMA in both directions with a smaller span term
fwd = ewma( x, span=15 ) # take EWMA in fwd direction
bwd = ewma( x[::-1], span=15 ) # take EWMA in bwd direction
c = np.vstack(( fwd, bwd[::-1] )) # lump fwd and bwd together
c = np.mean( c, axis=0 ) # average
# regular EWMA, with bias against trend
plot( ewma( x, span=20 ), 'b', label='EWMA, span=20' )
# "corrected" (?) EWMA
plot( c, 'r', label='Reversed-Recombined' )
legend(loc=8)
savefig( 'ewma_correction.png', fmt='png', dpi=100 )
As you can see, the EWMA bucks the trend uphill and downhill. We can correct for this (without having to implement a second-order scheme ourselves) by taking the EWMA in both directions and then averaging. I hope your data was stationary!