python pandas: how to calculate derivative/gradient
Question:
Given that I have the following two vectors:
In [99]: time_index
Out[99]:
[1484942413,
1484942712,
1484943012,
1484943312,
1484943612,
1484943912,
1484944212,
1484944511,
1484944811,
1484945110]
In [100]: bytes_in
Out[100]:
[1293981210388,
1293981379944,
1293981549960,
1293981720866,
1293981890968,
1293982062261,
1293982227492,
1293982391244,
1293982556526,
1293982722320]
Where bytes_in is an incremental only counter, and time_index is a list to unix timestamps (epoch).
Objective: What I would like to calculate is the bitrate.
That means that I will build a data frame like
In [101]: timeline = pandas.to_datetime(time_index, unit="s")
In [102]: recv = pandas.Series(bytes_in, timeline).resample("300S").mean().ffill().apply(lambda i: i*8)
In [103]: recv
Out[103]:
2017-01-20 20:00:00 10351849683104
2017-01-20 20:05:00 10351851039552
2017-01-20 20:10:00 10351852399680
2017-01-20 20:15:00 10351853766928
2017-01-20 20:20:00 10351855127744
2017-01-20 20:25:00 10351856498088
2017-01-20 20:30:00 10351857819936
2017-01-20 20:35:00 10351859129952
2017-01-20 20:40:00 10351860452208
2017-01-20 20:45:00 10351861778560
Freq: 300S, dtype: int64
Question: Now, what is strange, calculating the gradient manually gives me :
In [104]: (bytes_in[1]-bytes_in[0])*8/300
Out[104]: 4521.493333333333
which is the correct value ..
while calculating the gradient with pandas gives me
In [124]: recv.diff()
Out[124]:
2017-01-20 20:00:00 NaN
2017-01-20 20:05:00 1356448.0
2017-01-20 20:10:00 1360128.0
2017-01-20 20:15:00 1367248.0
2017-01-20 20:20:00 1360816.0
2017-01-20 20:25:00 1370344.0
2017-01-20 20:30:00 1321848.0
2017-01-20 20:35:00 1310016.0
2017-01-20 20:40:00 1322256.0
2017-01-20 20:45:00 1326352.0
Freq: 300S, dtype: float64
which is not the same as above, 1356448.0 is different than 4521.493333333333
Could you please enlighten on what I am doing wrong ?
Answers:
pd.Series.diff()
only takes the differences. It doesn’t divide by the delta of the index as well.
This gets you the answer
recv.diff() / recv.index.to_series().diff().dt.total_seconds()
2017-01-20 20:00:00 NaN
2017-01-20 20:05:00 4521.493333
2017-01-20 20:10:00 4533.760000
2017-01-20 20:15:00 4557.493333
2017-01-20 20:20:00 4536.053333
2017-01-20 20:25:00 4567.813333
2017-01-20 20:30:00 4406.160000
2017-01-20 20:35:00 4366.720000
2017-01-20 20:40:00 4407.520000
2017-01-20 20:45:00 4421.173333
Freq: 300S, dtype: float64
You could also use numpy.gradient
passing the bytes_in
and the delta you expect to have. This will not reduce the length by one, instead making assumptions about the edges.
np.gradient(bytes_in, 300) * 8
array([ 4521.49333333, 4527.62666667, 4545.62666667, 4546.77333333,
4551.93333333, 4486.98666667, 4386.44 , 4387.12 ,
4414.34666667, 4421.17333333])
A naive explanation would be that .diff()
literally subtracts following entries while np.gradient()
uses a central difference scheme.
As there is no builtin derivative
method in Pandas Series / DataFrame you can use https://github.com/scls19fr/pandas-helper-calc.
It will provide a new accessor called calc
to Pandas Series and DataFrames to calculate numerically derivative and integral.
So you will be able to simply do
recv.calc.derivative()
It’s using diff()
under the hood.
Or if you’d like to calculate the rate of change you can just use df.pct_change()
As a parameter you can enter df.pct_change(n)
, where n
is the lookback period assuming you have a datetime indexed dataframe.
To get the correct time derivative, change the index of your series
def derivate(serie):
df1 = (serie.diff() / serie.index.to_series().diff().dt.total_seconds()).dropna()
df1.index = serie.index[0:-1]
return df1
Given that I have the following two vectors:
In [99]: time_index
Out[99]:
[1484942413,
1484942712,
1484943012,
1484943312,
1484943612,
1484943912,
1484944212,
1484944511,
1484944811,
1484945110]
In [100]: bytes_in
Out[100]:
[1293981210388,
1293981379944,
1293981549960,
1293981720866,
1293981890968,
1293982062261,
1293982227492,
1293982391244,
1293982556526,
1293982722320]
Where bytes_in is an incremental only counter, and time_index is a list to unix timestamps (epoch).
Objective: What I would like to calculate is the bitrate.
That means that I will build a data frame like
In [101]: timeline = pandas.to_datetime(time_index, unit="s")
In [102]: recv = pandas.Series(bytes_in, timeline).resample("300S").mean().ffill().apply(lambda i: i*8)
In [103]: recv
Out[103]:
2017-01-20 20:00:00 10351849683104
2017-01-20 20:05:00 10351851039552
2017-01-20 20:10:00 10351852399680
2017-01-20 20:15:00 10351853766928
2017-01-20 20:20:00 10351855127744
2017-01-20 20:25:00 10351856498088
2017-01-20 20:30:00 10351857819936
2017-01-20 20:35:00 10351859129952
2017-01-20 20:40:00 10351860452208
2017-01-20 20:45:00 10351861778560
Freq: 300S, dtype: int64
Question: Now, what is strange, calculating the gradient manually gives me :
In [104]: (bytes_in[1]-bytes_in[0])*8/300
Out[104]: 4521.493333333333
which is the correct value ..
while calculating the gradient with pandas gives me
In [124]: recv.diff()
Out[124]:
2017-01-20 20:00:00 NaN
2017-01-20 20:05:00 1356448.0
2017-01-20 20:10:00 1360128.0
2017-01-20 20:15:00 1367248.0
2017-01-20 20:20:00 1360816.0
2017-01-20 20:25:00 1370344.0
2017-01-20 20:30:00 1321848.0
2017-01-20 20:35:00 1310016.0
2017-01-20 20:40:00 1322256.0
2017-01-20 20:45:00 1326352.0
Freq: 300S, dtype: float64
which is not the same as above, 1356448.0 is different than 4521.493333333333
Could you please enlighten on what I am doing wrong ?
pd.Series.diff()
only takes the differences. It doesn’t divide by the delta of the index as well.
This gets you the answer
recv.diff() / recv.index.to_series().diff().dt.total_seconds()
2017-01-20 20:00:00 NaN
2017-01-20 20:05:00 4521.493333
2017-01-20 20:10:00 4533.760000
2017-01-20 20:15:00 4557.493333
2017-01-20 20:20:00 4536.053333
2017-01-20 20:25:00 4567.813333
2017-01-20 20:30:00 4406.160000
2017-01-20 20:35:00 4366.720000
2017-01-20 20:40:00 4407.520000
2017-01-20 20:45:00 4421.173333
Freq: 300S, dtype: float64
You could also use numpy.gradient
passing the bytes_in
and the delta you expect to have. This will not reduce the length by one, instead making assumptions about the edges.
np.gradient(bytes_in, 300) * 8
array([ 4521.49333333, 4527.62666667, 4545.62666667, 4546.77333333,
4551.93333333, 4486.98666667, 4386.44 , 4387.12 ,
4414.34666667, 4421.17333333])
A naive explanation would be that .diff()
literally subtracts following entries while np.gradient()
uses a central difference scheme.
As there is no builtin derivative
method in Pandas Series / DataFrame you can use https://github.com/scls19fr/pandas-helper-calc.
It will provide a new accessor called calc
to Pandas Series and DataFrames to calculate numerically derivative and integral.
So you will be able to simply do
recv.calc.derivative()
It’s using diff()
under the hood.
Or if you’d like to calculate the rate of change you can just use df.pct_change()
As a parameter you can enter df.pct_change(n)
, where n
is the lookback period assuming you have a datetime indexed dataframe.
To get the correct time derivative, change the index of your series
def derivate(serie):
df1 = (serie.diff() / serie.index.to_series().diff().dt.total_seconds()).dropna()
df1.index = serie.index[0:-1]
return df1