Is there a nice way to check if numpy array elements are within a range?

Question:

I want to write:

assert np.all(0 < a < 2)

where a is a numpy array, but it doesn’t work. What’s a nice way to write this?

Asked By: Neil G

||

Answers:

You could use numpy.logical_and:

>>> a = np.repeat(1, 10)
>>> np.logical_and(a > 0, a < 2).all()
True

or using &.

>>> ((0 < a) & (a < 2)).all()
True
Answered By: Ashwini Chaudhary

You could achieve this within NumPy with either:

import numpy as np


def between_all_and(arr, a, b):
    return np.all((arr > a) & (arr < b))

or:

import numpy as np


def between_and_all(arr, a, b):
    return np.all(arr > a) and np.all(arr < b)

(or, equivalently, by calling np.ndarray.all() instead of np.all()).

Note that np.all() could be replaced by all(), which may be faster for smaller inputs, but it is much slower on larger ones.

While they give the same results, they both have sub-optimal short-circuiting properties:

  • between_all_and() ("all of and") will compute both arr > a and arr < b arrays before accessing short-circuited code (np.all())
  • between_and_all() ("and of all") will not short-circuit on arr < b before all arr > a tests are performed.

On randomly distributed arrays, this means that the two may have very different timings.

Alternatively, one can use a loop-based implementation accelerated with Numba:

import numpy as np
import numba as nb


@nb.njit
def between_nb(arr, a, b):
    arr = arr.ravel()
    for x in arr:
        if x <= a or x >= b:
            return False
    return True

This has much better short-circuiting properties, and does not create potentially large temporary arrays.

One can produce some benchmarks on batches (of size m) of arrays (of size n) containing random numbers uniformly distributed in the [0, 1] range, to get some ideas on which approaches are faster and by how much.


Benchmarks

Assuming an array of uniformly distributed random numbers in the [0, 1] range, if one checks for different ranges, it is possible to produce cases with different short-circuiting:

  • an "average case" for a range like (0.0, 0.999)
  • a "worst case" (no short-circuiting) for a range like (-1.0, 2.0)
  • a "best case" (potentially immediate short-circuiting) for a range like (2.0, 3.0)

The benchmarks are produced with:

import pandas as pd
import matplotlib.pyplot as plt


def benchmark(
    funcs,
    args=None,
    kws=None,
    ii=range(4, 24),
    m=2 ** 15,
    is_equal=np.allclose,
    seed=0,
    unit="ms",
    verbose=True
):
    labels = [func.__name__ for func in funcs]
    units = {"s": 0, "ms": 3, "µs": 6, "ns": 9}
    args = tuple(args) if args else ()
    kws = dict(kws) if kws else {}
    assert unit in units
    np.random.seed(seed)
    timings = {}
    for i in ii:
        n = 2 ** i
        k = 1 + m // n
        if verbose:
            print(f"i={i}, n={n}, m={m}, k={k}")
        arrs = np.random.random((k, n))
        base = np.array([funcs[0](arr, *args, **kws) for arr in arrs])
        timings[n] = []
        for func in funcs:
            res = np.array([func(arr, *args, **kws) for arr in arrs])
            is_good = is_equal(base, res)
            timed = %timeit -n 8 -r 8 -q -o [func(arr, *args, **kws) for arr in arrs]
            timing = timed.best / k
            timings[n].append(timing if is_good else None)
            if verbose:
                print(
                    f"{func.__name__:>24}"
                    f"  {is_good!s:5}"
                    f"  {timing * (10 ** units[unit]):10.3f} {unit}"
                    f"  {timings[n][0] / timing:5.1f}x")
    return timings, labels


def plot(timings, labels, title=None, xlabel="Input Size / #", unit="ms"):
    n_rows = 1
    n_cols = 3
    fig, axs = plt.subplots(n_rows, n_cols, figsize=(8 * n_cols, 6 * n_rows), squeeze=False)
    units = {"s": 0, "ms": 3, "µs": 6, "ns": 9}
    df = pd.DataFrame(data=timings, index=labels).transpose()
    
    base = df[[labels[0]]].to_numpy()
    (df * 10 ** units[unit]).plot(marker="o", xlabel=xlabel, ylabel=f"Best timing / {unit}", ax=axs[0, 0])
    (df / base * 100).plot(marker='o', xlabel=xlabel, ylabel='Relative speed /labels %', logx=True, ax=axs[0, 1])
    (base / df).plot(marker='o', xlabel=xlabel, ylabel='Speed Gain / x', ax=axs[0, 2])

    if title:
        fig.suptitle(title)
    fig.patch.set_facecolor('white')

to be called as follows:

funcs = between_all_and, between_and_all, between_all_nb

avg_timings, avg_labels = benchmark(funcs, args=(0.01, 0.99), unit="µs", verbose=False)
wrs_timings, wrs_labels = benchmark(funcs, args=(-1.0, 2.0), unit="µs", verbose=False)
bst_timings, bst_labels = benchmark(funcs, args=(2.0, 3.0), unit="µs", verbose=False)
plot(avg_timings, avg_labels, "Average Case", unit="µs")
plot(wrs_timings, wrs_labels, "Worst Case", unit="µs")
plot(bst_timings, bst_labels, "Best Case", unit="µs")

to produce:

bm_avg

bm_wrs

bm_bst

These can be used to guess in which regimes which one is faster.

Typically, the Numba-based approach is not only the most efficient, but also the fastest.

Answered By: norok2
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.