Counting consecutive positive values in Python/pandas array
Question:
I’m trying to count consecutive up days in equity return data; so if a positive day is 1 and a negative is 0, a list y=[0,0,1,1,1,0,0,1,0,1,1]
should return z=[0,0,1,2,3,0,0,1,0,1,2]
.
I’ve come to a solution which has few lines of code, but is very slow:
import pandas
y = pandas.Series([0,0,1,1,1,0,0,1,0,1,1])
def f(x):
return reduce(lambda a,b:reduce((a+b)*b,x)
z = pandas.expanding_apply(y,f)
I’m guessing I’m looping through the whole list y
too many times. Is there a nice Pythonic way of achieving what I want while only going through the data once? I could write a loop myself but wondering if there’s a better way.
Answers:
why the obsession with the ultra-pythonic way of doing things? readability + efficiency trumps “leet hackerz style.”
I’d just do it like so:
a = [0,0,1,1,1,0,0,1,0,1,1]
b = [0,0,0,0,0,0,0,0,0,0,0]
for i in range(len(a)):
if a[i] == 1:
b[i] = b[i-1] + 1
else:
b[i] = 0
>>> y = pandas.Series([0,0,1,1,1,0,0,1,0,1,1])
The following may seem a little magical, but actually uses some common idioms: since pandas
doesn’t yet have nice native support for a contiguous groupby
, you often find yourself needing something like this.
>>> y * (y.groupby((y != y.shift()).cumsum()).cumcount() + 1)
0 0
1 0
2 1
3 2
4 3
5 0
6 0
7 1
8 0
9 1
10 2
dtype: int64
Some explanation: first, we compare y
against a shifted version of itself to find when the contiguous groups begin:
>>> y != y.shift()
0 True
1 False
2 True
3 False
4 False
5 True
6 False
7 True
8 True
9 True
10 False
dtype: bool
Then (since False == 0 and True == 1) we can apply a cumulative sum to get a number for the groups:
>>> (y != y.shift()).cumsum()
0 1
1 1
2 2
3 2
4 2
5 3
6 3
7 4
8 5
9 6
10 6
dtype: int32
We can use groupby
and cumcount
to get us an integer counting up in each group:
>>> y.groupby((y != y.shift()).cumsum()).cumcount()
0 0
1 1
2 0
3 1
4 2
5 0
6 1
7 0
8 0
9 0
10 1
dtype: int64
Add one:
>>> y.groupby((y != y.shift()).cumsum()).cumcount() + 1
0 1
1 2
2 1
3 2
4 3
5 1
6 2
7 1
8 1
9 1
10 2
dtype: int64
And finally zero the values where we had zero to begin with:
>>> y * (y.groupby((y != y.shift()).cumsum()).cumcount() + 1)
0 0
1 0
2 1
3 2
4 3
5 0
6 0
7 1
8 0
9 1
10 2
dtype: int64
Keeping things simple, using one array, one loop, and one conditional.
a = [0,0,1,1,1,0,0,1,0,1,1]
for i in range(1, len(a)):
if a[i] == 1:
a[i] += a[i - 1]
If something is clear, it is “pythonic”. Frankly, I cannot even make your original solution work. Also, if it does work, I am curious if it is faster than a loop. Did you compare?
Now, since we’ve started discussing efficiency, here are some insights.
Loops in Python are inherently slow, no matter what you do. Of course, if you are using pandas, you are also using numpy underneath, with all the performance advantages. Just don’t destroy them by looping. This is not to mention that Python lists take a lot more memory than you may think; potentially much more than 8 bytes * length
, as every integer may be wrapped into a separate object and placed into a separate area in memory, and pointed at by a pointer from the list.
Vectorization provided by numpy should be sufficient IF you can find some way to express this function without looping. In fact, I wonder if there some way to represent it by using expressions such as A+B*C
. If you can construct this function out of functions in Lapack, then you can even potentially beat ordinary C++ code compiled with optimization.
You can also use one of the compiled approaches to speed-up your loops. See a solution with Numba on numpy arrays below. Another option is to use PyPy, though you probably can’t properly combine it with pandas.
In [140]: import pandas as pd
In [141]: import numpy as np
In [143]: a=np.random.randint(2,size=1000000)
# Try the simple approach
In [147]: def simple(L):
for i in range(len(L)):
if L[i]==1:
L[i] += L[i-1]
In [148]: %time simple(L)
CPU times: user 255 ms, sys: 20.8 ms, total: 275 ms
Wall time: 248 ms
# Just-In-Time compilation
In[149]: from numba import jit
@jit
def faster(z):
prev=0
for i in range(len(z)):
cur=z[i]
if cur==0:
prev=0
else:
prev=prev+cur
z[i]=prev
In [151]: %time faster(a)
CPU times: user 51.9 ms, sys: 1.12 ms, total: 53 ms
Wall time: 51.9 ms
In [159]: list(L)==list(a)
Out[159]: True
In fact, most of the time in the second example above was spent on Just-In-Time compilation. Instead (remember to copy, as the function changes the array).
b=a.copy()
In [38]: %time faster(b)
CPU times: user 55.1 ms, sys: 1.56 ms, total: 56.7 ms
Wall time: 56.3 ms
In [39]: %time faster(c)
CPU times: user 10.8 ms, sys: 42 µs, total: 10.9 ms
Wall time: 10.9 ms
So for subsequent calls we have a 25x-speedup compared to the simple version. I suggest you read High Performance Python if you want to know more.
A similar approach to the answer from @DSM with a fewer steps:
s.groupby(s.ne(s.shift()).cumsum()).cumsum()
Output:
0 0
1 0
2 1
3 2
4 3
5 0
6 0
7 1
8 0
9 1
10 2
dtype: int64
I’m trying to count consecutive up days in equity return data; so if a positive day is 1 and a negative is 0, a list y=[0,0,1,1,1,0,0,1,0,1,1]
should return z=[0,0,1,2,3,0,0,1,0,1,2]
.
I’ve come to a solution which has few lines of code, but is very slow:
import pandas
y = pandas.Series([0,0,1,1,1,0,0,1,0,1,1])
def f(x):
return reduce(lambda a,b:reduce((a+b)*b,x)
z = pandas.expanding_apply(y,f)
I’m guessing I’m looping through the whole list y
too many times. Is there a nice Pythonic way of achieving what I want while only going through the data once? I could write a loop myself but wondering if there’s a better way.
why the obsession with the ultra-pythonic way of doing things? readability + efficiency trumps “leet hackerz style.”
I’d just do it like so:
a = [0,0,1,1,1,0,0,1,0,1,1]
b = [0,0,0,0,0,0,0,0,0,0,0]
for i in range(len(a)):
if a[i] == 1:
b[i] = b[i-1] + 1
else:
b[i] = 0
>>> y = pandas.Series([0,0,1,1,1,0,0,1,0,1,1])
The following may seem a little magical, but actually uses some common idioms: since pandas
doesn’t yet have nice native support for a contiguous groupby
, you often find yourself needing something like this.
>>> y * (y.groupby((y != y.shift()).cumsum()).cumcount() + 1)
0 0
1 0
2 1
3 2
4 3
5 0
6 0
7 1
8 0
9 1
10 2
dtype: int64
Some explanation: first, we compare y
against a shifted version of itself to find when the contiguous groups begin:
>>> y != y.shift()
0 True
1 False
2 True
3 False
4 False
5 True
6 False
7 True
8 True
9 True
10 False
dtype: bool
Then (since False == 0 and True == 1) we can apply a cumulative sum to get a number for the groups:
>>> (y != y.shift()).cumsum()
0 1
1 1
2 2
3 2
4 2
5 3
6 3
7 4
8 5
9 6
10 6
dtype: int32
We can use groupby
and cumcount
to get us an integer counting up in each group:
>>> y.groupby((y != y.shift()).cumsum()).cumcount()
0 0
1 1
2 0
3 1
4 2
5 0
6 1
7 0
8 0
9 0
10 1
dtype: int64
Add one:
>>> y.groupby((y != y.shift()).cumsum()).cumcount() + 1
0 1
1 2
2 1
3 2
4 3
5 1
6 2
7 1
8 1
9 1
10 2
dtype: int64
And finally zero the values where we had zero to begin with:
>>> y * (y.groupby((y != y.shift()).cumsum()).cumcount() + 1)
0 0
1 0
2 1
3 2
4 3
5 0
6 0
7 1
8 0
9 1
10 2
dtype: int64
Keeping things simple, using one array, one loop, and one conditional.
a = [0,0,1,1,1,0,0,1,0,1,1]
for i in range(1, len(a)):
if a[i] == 1:
a[i] += a[i - 1]
If something is clear, it is “pythonic”. Frankly, I cannot even make your original solution work. Also, if it does work, I am curious if it is faster than a loop. Did you compare?
Now, since we’ve started discussing efficiency, here are some insights.
Loops in Python are inherently slow, no matter what you do. Of course, if you are using pandas, you are also using numpy underneath, with all the performance advantages. Just don’t destroy them by looping. This is not to mention that Python lists take a lot more memory than you may think; potentially much more than 8 bytes * length
, as every integer may be wrapped into a separate object and placed into a separate area in memory, and pointed at by a pointer from the list.
Vectorization provided by numpy should be sufficient IF you can find some way to express this function without looping. In fact, I wonder if there some way to represent it by using expressions such as A+B*C
. If you can construct this function out of functions in Lapack, then you can even potentially beat ordinary C++ code compiled with optimization.
You can also use one of the compiled approaches to speed-up your loops. See a solution with Numba on numpy arrays below. Another option is to use PyPy, though you probably can’t properly combine it with pandas.
In [140]: import pandas as pd
In [141]: import numpy as np
In [143]: a=np.random.randint(2,size=1000000)
# Try the simple approach
In [147]: def simple(L):
for i in range(len(L)):
if L[i]==1:
L[i] += L[i-1]
In [148]: %time simple(L)
CPU times: user 255 ms, sys: 20.8 ms, total: 275 ms
Wall time: 248 ms
# Just-In-Time compilation
In[149]: from numba import jit
@jit
def faster(z):
prev=0
for i in range(len(z)):
cur=z[i]
if cur==0:
prev=0
else:
prev=prev+cur
z[i]=prev
In [151]: %time faster(a)
CPU times: user 51.9 ms, sys: 1.12 ms, total: 53 ms
Wall time: 51.9 ms
In [159]: list(L)==list(a)
Out[159]: True
In fact, most of the time in the second example above was spent on Just-In-Time compilation. Instead (remember to copy, as the function changes the array).
b=a.copy()
In [38]: %time faster(b)
CPU times: user 55.1 ms, sys: 1.56 ms, total: 56.7 ms
Wall time: 56.3 ms
In [39]: %time faster(c)
CPU times: user 10.8 ms, sys: 42 µs, total: 10.9 ms
Wall time: 10.9 ms
So for subsequent calls we have a 25x-speedup compared to the simple version. I suggest you read High Performance Python if you want to know more.
A similar approach to the answer from @DSM with a fewer steps:
s.groupby(s.ne(s.shift()).cumsum()).cumsum()
Output:
0 0
1 0
2 1
3 2
4 3
5 0
6 0
7 1
8 0
9 1
10 2
dtype: int64