find index of n consecutive values greater than zero with the largest sum from a numpy array (or pandas Series)
Question:
So here is my problem: I have an array like this:
arr = array([0, 0, 1, 8, 10, 20, 26, 32, 37, 52, 0, 0, 46, 42, 30, 19, 8, 2, 0, 0, 0])
In this array I want to find n consecutive values, greater than zero with the biggest sum. In this example with n = 5
this would be array([20, 26, 32, 37, 52])
and the index would be 5
.
What I tried is of course a loop:
n = 5
max_sum = 0
max_loc = 0
for i in range(arr.size - n):
if all(arr[i:i + n] > 0) and arr[i:i + n].sum() > max_sum:
max_sum = arr[i:i + n].sum()
max_loc = i
print(max_loc)
This is fine for not too many short arrays but of course I need to use this on many not so short arrays.
I was experimenting with numpy so I would only have to iterate non-zero value groups:
diffs = np.concatenate((np.array([False]), np.diff(arr > 0)))
groups = np.split(arr, np.where(diffs)[0])
for group in groups:
if group.sum() > 0 and group.size >= n:
...
but I believe this is nice but not the right direction. I am looking for a simpler and faster numpy / pandas solution that really uses the powers of these packages.
Answers:
You can use sliding_window_view
:
from numpy.lib.stride_tricks import sliding_window_view
N = 5
win = sliding_window_view(arr, N)
idx = ((win.sum(axis=1)) * ((win>0).all(axis=1))).argmax()
print(idx, arr[idx:idx+N])
# Output
5 [20 26 32 37 52]
Answer greatly enhanced by chrslg to save memory and keep a win
as a view.
Update
A nice bonus is this should work with Pandas Series just fine.
N = 5
idx = pd.Series(arr).where(lambda x: x > 0).rolling(N).sum().shift(-N+1).idxmax()
print(idx, arr[idx:idx+N])
# Output
5 [20 26 32 37 52]
Using cross-correlation, numpy.correlate
, is a possible, concise and fast solution:
n=5
idx = np.argmax(np.correlate(arr, np.ones(n), 'valid'))
idx, arr[idx:(idx+5)]
Another possible solution:
n, l = 5, arr.size
idx = np.argmax([np.sum(np.roll(arr,-x)[:n]) for x in range(l-n+1)])
idx, arr[idx:(idx+n)]
Output:
(5, array([20, 26, 32, 37, 52]))
So here is my problem: I have an array like this:
arr = array([0, 0, 1, 8, 10, 20, 26, 32, 37, 52, 0, 0, 46, 42, 30, 19, 8, 2, 0, 0, 0])
In this array I want to find n consecutive values, greater than zero with the biggest sum. In this example with n = 5
this would be array([20, 26, 32, 37, 52])
and the index would be 5
.
What I tried is of course a loop:
n = 5
max_sum = 0
max_loc = 0
for i in range(arr.size - n):
if all(arr[i:i + n] > 0) and arr[i:i + n].sum() > max_sum:
max_sum = arr[i:i + n].sum()
max_loc = i
print(max_loc)
This is fine for not too many short arrays but of course I need to use this on many not so short arrays.
I was experimenting with numpy so I would only have to iterate non-zero value groups:
diffs = np.concatenate((np.array([False]), np.diff(arr > 0)))
groups = np.split(arr, np.where(diffs)[0])
for group in groups:
if group.sum() > 0 and group.size >= n:
...
but I believe this is nice but not the right direction. I am looking for a simpler and faster numpy / pandas solution that really uses the powers of these packages.
You can use sliding_window_view
:
from numpy.lib.stride_tricks import sliding_window_view
N = 5
win = sliding_window_view(arr, N)
idx = ((win.sum(axis=1)) * ((win>0).all(axis=1))).argmax()
print(idx, arr[idx:idx+N])
# Output
5 [20 26 32 37 52]
Answer greatly enhanced by chrslg to save memory and keep a win
as a view.
Update
A nice bonus is this should work with Pandas Series just fine.
N = 5
idx = pd.Series(arr).where(lambda x: x > 0).rolling(N).sum().shift(-N+1).idxmax()
print(idx, arr[idx:idx+N])
# Output
5 [20 26 32 37 52]
Using cross-correlation, numpy.correlate
, is a possible, concise and fast solution:
n=5
idx = np.argmax(np.correlate(arr, np.ones(n), 'valid'))
idx, arr[idx:(idx+5)]
Another possible solution:
n, l = 5, arr.size
idx = np.argmax([np.sum(np.roll(arr,-x)[:n]) for x in range(l-n+1)])
idx, arr[idx:(idx+n)]
Output:
(5, array([20, 26, 32, 37, 52]))