In numpy, most computationally efficient way to find the array with shortest non-zero sequence in array of arrays

Question:

Say that I have an array of arrays

import numpy as np 

z = np.array(
    [
     [1, 1, 0, 0, 0, 0],
     [1, 1, 1, 1, 1, 0],
     [1, 1, 1, 0, 0, 0],
     [1, 1, 1, 1, 1, 1],
    ]
)

Where 1s start on the left side of each array, and 0s on the right side if any. For many applications, this is how arrays are padded so that each array is of the same length in an array of arrays.

How would I get the shortest sequence of non-zeros for such an array.

In this case, the shortest sequence is the first array, which has a length of 2.

The obvious answer is to iterate over each array and find the index of the first zero, but I feel that there’s probably a method that takes more advantage of numpy’s c processing.

Asked By: SantoshGupta7

||

Answers:

Use np.count_nonzero + np.min:

res = np.count_nonzero(z, axis=1).min()
print(res)

Output

2

The function count_nonzero returns an array like:

[2 5 3 6]

then simply find the minimum value.

If you want the index of the row, use np.argmin instead.

Answered By: Dani Mesejo

If you want to know which sub-array has the least zeros then you could use:

z_sums = z.sum(axis = 1)
z_least = np.argmin(z_sums)
Answered By: user19077881

Benchmark with a 5000×5000 array:

 74.3 ms  Dani
 33.8 ms  user19077881
  2.6 ms  Kelly1
  1.4 ms  Kelly2

My Kelly1 is an O(m+n) saddleback search from top-right to bottom-left:

def Kelly1(z):
    m, n = z.shape
    j = n - 1
    for i in range(m):
        while not z[i, j]:
            j -= 1
            if j < 0:
                return 0
    return j + 1

(Michael Szczesny said it can trivially be made ~150x faster (if I remember correctly) by using Numba. I’m not equipped to test that myself, though.)

My Kelly2 is an O(m log n) horizontal binary search, using NumPy to check whether a column is full of non-zeros:

def Kelly2(z):
    m, n = z.shape
    lo, hi = 0, n
    while lo < hi:
        mid = (lo + hi) // 2
        if z[:, mid].all():
            lo = mid + 1
        else:
            hi = mid
    return lo

(Could be shorter by using bisect with a key, but I don’t have Python 3.10 to test right now.)

Note: Dani and user19077881 return different results: The smallest number of non-zeros in any row, or the row with the fewest non-zeros. I followed Dani’s lead, as that’s the accepted answer. It doesn’t really matter, as you can compute one result from the other very quickly (by finding the index of the first zero in the column or row, respectively).

Full benchmark code (Try it online!):

import numpy as np
from timeit import timeit
import random

m, n = 5000, 5000

def genz():
    lo = random.randrange(n*5//100, n//3)
    return np.array(
        [
            [1]*ones + [0]*(n-ones)
            for ones in random.choices(range(lo, n+1), k=m)
        ]
    )

def Dani(z):
    return np.count_nonzero(z, axis=1).min()

def user19077881(z):
    z_sums = z.sum(axis = 1)
    z_least = np.argmin(z_sums)
    return z_least

def Kelly1(z):
    m, n = z.shape
    j = n - 1
    for i in range(m):
        while not z[i, j]:
            j -= 1
            if j < 0:
                return 0
    return j + 1

def Kelly2(z):
    m, n = z.shape
    lo, hi = 0, n
    while lo < hi:
        mid = (lo + hi) // 2
        if z[:, mid].all():
            lo = mid + 1
        else:
            hi = mid
    return lo

funcs = Dani, user19077881, Kelly1, Kelly2

for _ in range(3):
    z = genz()
    for f in funcs:
        t = timeit(lambda: f(z), number=1)
        print('%5.1f ms ' % (t * 1e3), f.__name__)
    print()
Answered By: Kelly Bundy
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.