Counting consecutive positive values in Python/pandas array

Question:

I’m trying to count consecutive up days in equity return data; so if a positive day is 1 and a negative is 0, a list y=[0,0,1,1,1,0,0,1,0,1,1] should return z=[0,0,1,2,3,0,0,1,0,1,2].

I’ve come to a solution which has few lines of code, but is very slow:

import pandas
y = pandas.Series([0,0,1,1,1,0,0,1,0,1,1])

def f(x):
    return reduce(lambda a,b:reduce((a+b)*b,x)

z = pandas.expanding_apply(y,f)

I’m guessing I’m looping through the whole list y too many times. Is there a nice Pythonic way of achieving what I want while only going through the data once? I could write a loop myself but wondering if there’s a better way.

Asked By: alex314159

||

Answers:

why the obsession with the ultra-pythonic way of doing things? readability + efficiency trumps “leet hackerz style.”

I’d just do it like so:

a = [0,0,1,1,1,0,0,1,0,1,1]
b = [0,0,0,0,0,0,0,0,0,0,0]

for i in range(len(a)):
    if a[i] == 1:
        b[i] = b[i-1] + 1
    else:
        b[i] = 0
Answered By: Coding Orange
>>> y = pandas.Series([0,0,1,1,1,0,0,1,0,1,1])

The following may seem a little magical, but actually uses some common idioms: since pandas doesn’t yet have nice native support for a contiguous groupby, you often find yourself needing something like this.

>>> y * (y.groupby((y != y.shift()).cumsum()).cumcount() + 1)
0     0
1     0
2     1
3     2
4     3
5     0
6     0
7     1
8     0
9     1
10    2
dtype: int64

Some explanation: first, we compare y against a shifted version of itself to find when the contiguous groups begin:

>>> y != y.shift()
0      True
1     False
2      True
3     False
4     False
5      True
6     False
7      True
8      True
9      True
10    False
dtype: bool

Then (since False == 0 and True == 1) we can apply a cumulative sum to get a number for the groups:

>>> (y != y.shift()).cumsum()
0     1
1     1
2     2
3     2
4     2
5     3
6     3
7     4
8     5
9     6
10    6
dtype: int32

We can use groupby and cumcount to get us an integer counting up in each group:

>>> y.groupby((y != y.shift()).cumsum()).cumcount()
0     0
1     1
2     0
3     1
4     2
5     0
6     1
7     0
8     0
9     0
10    1
dtype: int64

Add one:

>>> y.groupby((y != y.shift()).cumsum()).cumcount() + 1
0     1
1     2
2     1
3     2
4     3
5     1
6     2
7     1
8     1
9     1
10    2
dtype: int64

And finally zero the values where we had zero to begin with:

>>> y * (y.groupby((y != y.shift()).cumsum()).cumcount() + 1)
0     0
1     0
2     1
3     2
4     3
5     0
6     0
7     1
8     0
9     1
10    2
dtype: int64
Answered By: DSM

Keeping things simple, using one array, one loop, and one conditional.

a = [0,0,1,1,1,0,0,1,0,1,1]

for i in range(1, len(a)):
    if a[i] == 1:
        a[i] += a[i - 1]
Answered By: Dan

If something is clear, it is “pythonic”. Frankly, I cannot even make your original solution work. Also, if it does work, I am curious if it is faster than a loop. Did you compare?

Now, since we’ve started discussing efficiency, here are some insights.

Loops in Python are inherently slow, no matter what you do. Of course, if you are using pandas, you are also using numpy underneath, with all the performance advantages. Just don’t destroy them by looping. This is not to mention that Python lists take a lot more memory than you may think; potentially much more than 8 bytes * length, as every integer may be wrapped into a separate object and placed into a separate area in memory, and pointed at by a pointer from the list.

Vectorization provided by numpy should be sufficient IF you can find some way to express this function without looping. In fact, I wonder if there some way to represent it by using expressions such as A+B*C. If you can construct this function out of functions in Lapack, then you can even potentially beat ordinary C++ code compiled with optimization.

You can also use one of the compiled approaches to speed-up your loops. See a solution with Numba on numpy arrays below. Another option is to use PyPy, though you probably can’t properly combine it with pandas.

In [140]: import pandas as pd
In [141]: import numpy as np
In [143]: a=np.random.randint(2,size=1000000)

# Try the simple approach
In [147]: def simple(L):
              for i in range(len(L)):
                  if L[i]==1:
                      L[i] += L[i-1]


In [148]: %time simple(L)
CPU times: user 255 ms, sys: 20.8 ms, total: 275 ms
Wall time: 248 ms


# Just-In-Time compilation
In[149]: from numba import jit
@jit          
def faster(z):
    prev=0
    for i in range(len(z)):
        cur=z[i]
        if cur==0:
             prev=0
        else:
             prev=prev+cur
             z[i]=prev

In [151]: %time faster(a)
CPU times: user 51.9 ms, sys: 1.12 ms, total: 53 ms
Wall time: 51.9 ms


In [159]: list(L)==list(a)
Out[159]: True

In fact, most of the time in the second example above was spent on Just-In-Time compilation. Instead (remember to copy, as the function changes the array).

b=a.copy()
In [38]: %time faster(b)
CPU times: user 55.1 ms, sys: 1.56 ms, total: 56.7 ms
Wall time: 56.3 ms

In [39]: %time faster(c)
CPU times: user 10.8 ms, sys: 42 µs, total: 10.9 ms
Wall time: 10.9 ms

So for subsequent calls we have a 25x-speedup compared to the simple version. I suggest you read High Performance Python if you want to know more.

Answered By: Sergey Orshanskiy

A similar approach to the answer from @DSM with a fewer steps:

s.groupby(s.ne(s.shift()).cumsum()).cumsum()

Output:

0     0
1     0
2     1
3     2
4     3
5     0
6     0
7     1
8     0
9     1
10    2
dtype: int64
Answered By: Mykola Zotko
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.