Average on overlapping windows in Python
Question:
I’m trying to compute a moving average but with a set step size between each average. For example, if I was computing the average of a 4 element window every 2 elements:
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
This should produce the average of [1, 2, 3, 4], [3, 4, 5, 6], [5, 6, 7, 8], [7, 8, 9, 10].
window_avg = [2.5, 4.5, 6.5, 8.5]
My data is such that the ending will be truncated before processing so there is no problem with the length with respect to window size.
I’ve read a bit about how to do moving averages in Python and there seems to be a lot of usage of itertools; however, the iterators go one element at a time and I can’t figure out how to have a step size between each calculation of the average.
(How to calculate moving average in Python 3?)
I have also been able to do this before in MATLAB by creating a matrix of indices which are overlapping and then indexing the data vector and performing a column wise mean (Create matrix by repeatedly overlapping a vector). However, since this vector is rather large (~70 000 elements, window of 450 samples, average every 30 samples), the computation would probably require too much memory.
Any help would be greatly appreciated. I am using Python 2.7.
Answers:
One way to compute the average of a sliding window across a list in Python is to use a list comprehension. You can use
>>> range(0, len(data), 2)
[0, 2, 4, 6, 8]
to get the starting indices of each window, and then numpy
‘s mean
function to take the average of each window. See the demo below:
>>> import numpy as np
>>> window_size = 4
>>> stride = 2
>>> window_avg = [ np.mean(data[i:i+window_size]) for i in range(0, len(data), stride)
if i+window_size <= len(data) ]
>>> window_avg
[2.5, 4.5, 6.5, 8.5]
Note that the list comprehension does have a condition to ensure that it only computes the average of “full windows”, or sublists with exactly window_size
elements.
When run on a dataset of the size discussed in the OP, this method computes on my MBA in a little over 200 ms:
In [5]: window_size = 450
In [6]: data = range(70000)
In [7]: stride = 30
In [8]: timeit [ np.mean(data[i:i+window_size]) for i in range(0, len(data), stride)
if i+window_size <= len(data) ]
1 loops, best of 3: 220 ms per loop
It is about twice as fast on my machine to the itertools
approach presented by @Abhijit:
In [9]: timeit map(np.mean, izip(*(islice(it, i, None, stride) for i, it in enumerate(tee(data, window_size)))))
1 loops, best of 3: 436 ms per loop
The following approach uses itertools at its fullest to create moving average window of size 4
. As then entire expression is a generator which is evaluated when calculating the average, it has a complexity of O(n).
>>> import numpy as np
>>> from itertools import count, tee, izip, islice
>>> map(np.mean, izip(*(islice(it,i,None,2)
for i, it in enumerate(tee(data, 4)))))
[2.5, 4.5, 6.5, 8.5]
Its interesting to note, how individual itertools function works in accord.
itertools.tee
n-plicates an iterator, in this case 4 times
- enumerate creates an enumerator object which yield a tuple of index and element (which is the iterator)
- slice the iterator with stride 2, starting from the index position.
You can use rolling function of Pandas DataFrame,
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
df = pd.DataFrame(data)
>>>
0
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
Using Pandas DataFrame’s rolling function,
df.rolling(4).mean().dropna()[::2]
>>>
0
3 2.5
5 4.5
7 6.5
9 8.5
4 is the window size and 2 in [::2] can be assumed to be step size.
Actually, df.rolling(4).mean().dropna()
shift the window 1-by-1 and by applying index [::2], we pick one after taking two steps.
Alternatively,
If you have Pandas version > 1.5, you can give step size. Note that, center argument must be ‘True’. The solution:
df.rolling(4, step=2, center=True).mean().dropna()
>>> df
0
2 2.5
4 4.5
6 6.5
8 8.5
I’m trying to compute a moving average but with a set step size between each average. For example, if I was computing the average of a 4 element window every 2 elements:
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
This should produce the average of [1, 2, 3, 4], [3, 4, 5, 6], [5, 6, 7, 8], [7, 8, 9, 10].
window_avg = [2.5, 4.5, 6.5, 8.5]
My data is such that the ending will be truncated before processing so there is no problem with the length with respect to window size.
I’ve read a bit about how to do moving averages in Python and there seems to be a lot of usage of itertools; however, the iterators go one element at a time and I can’t figure out how to have a step size between each calculation of the average.
(How to calculate moving average in Python 3?)
I have also been able to do this before in MATLAB by creating a matrix of indices which are overlapping and then indexing the data vector and performing a column wise mean (Create matrix by repeatedly overlapping a vector). However, since this vector is rather large (~70 000 elements, window of 450 samples, average every 30 samples), the computation would probably require too much memory.
Any help would be greatly appreciated. I am using Python 2.7.
One way to compute the average of a sliding window across a list in Python is to use a list comprehension. You can use
>>> range(0, len(data), 2)
[0, 2, 4, 6, 8]
to get the starting indices of each window, and then numpy
‘s mean
function to take the average of each window. See the demo below:
>>> import numpy as np
>>> window_size = 4
>>> stride = 2
>>> window_avg = [ np.mean(data[i:i+window_size]) for i in range(0, len(data), stride)
if i+window_size <= len(data) ]
>>> window_avg
[2.5, 4.5, 6.5, 8.5]
Note that the list comprehension does have a condition to ensure that it only computes the average of “full windows”, or sublists with exactly window_size
elements.
When run on a dataset of the size discussed in the OP, this method computes on my MBA in a little over 200 ms:
In [5]: window_size = 450
In [6]: data = range(70000)
In [7]: stride = 30
In [8]: timeit [ np.mean(data[i:i+window_size]) for i in range(0, len(data), stride)
if i+window_size <= len(data) ]
1 loops, best of 3: 220 ms per loop
It is about twice as fast on my machine to the itertools
approach presented by @Abhijit:
In [9]: timeit map(np.mean, izip(*(islice(it, i, None, stride) for i, it in enumerate(tee(data, window_size)))))
1 loops, best of 3: 436 ms per loop
The following approach uses itertools at its fullest to create moving average window of size 4
. As then entire expression is a generator which is evaluated when calculating the average, it has a complexity of O(n).
>>> import numpy as np
>>> from itertools import count, tee, izip, islice
>>> map(np.mean, izip(*(islice(it,i,None,2)
for i, it in enumerate(tee(data, 4)))))
[2.5, 4.5, 6.5, 8.5]
Its interesting to note, how individual itertools function works in accord.
itertools.tee
n-plicates an iterator, in this case 4 times- enumerate creates an enumerator object which yield a tuple of index and element (which is the iterator)
- slice the iterator with stride 2, starting from the index position.
You can use rolling function of Pandas DataFrame,
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
df = pd.DataFrame(data)
>>>
0
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
Using Pandas DataFrame’s rolling function,
df.rolling(4).mean().dropna()[::2]
>>>
0
3 2.5
5 4.5
7 6.5
9 8.5
4 is the window size and 2 in [::2] can be assumed to be step size.
Actually, df.rolling(4).mean().dropna()
shift the window 1-by-1 and by applying index [::2], we pick one after taking two steps.
Alternatively,
If you have Pandas version > 1.5, you can give step size. Note that, center argument must be ‘True’. The solution:
df.rolling(4, step=2, center=True).mean().dropna()
>>> df
0
2 2.5
4 4.5
6 6.5
8 8.5