np.random.choice with a big probabilities array

Question:

I know that we can use a probability array for the choice function, but my question is how it works for big arrays. Let’s assume that I want to have 1 thousand random numbers between 0-65535. How can we define the probability array to have p=0.4 for numbers less than 1000 and p=0.6 for the rest?

I tried to pass the range of numbers to the choice function, but apparently, it doesn’t work like that.

Asked By: Vahid Heidaripour

||

Answers:

From the docs, each element of the argument p gives the probability for the corresponding element in a.

Since p and a need to have the same size, create a p of the same size as a:

a = np.arange(65536)
n_elem = len(a)

p = np.zeros_like(a, dtype=float)

Now, find all the elements of a less than 1000, and set p for those indices to 0.4 divided by the number of elements less than 1000. For this case, you can hardcode that calculation, since you know which elements of an arange are less than 1000:

p[:1000] = 0.4 / 1000
p[1000:] = 0.6 / 64536

For the general case where a is not derived from an arange, you could do:

lt1k = a < 1000
n_lt1k = lt1k.sum()

p[lt1k] = 0.4 / n_lt1k
p[~lt1k] = 0.6 / (n_elem - n_lt1k)

Note that p must sum to 1:

assert np.allclose(p.sum(), 1.0)

Now use a and p in choice:

selection = np.random.choice(a, size=(1000,), p=p)

To verify that the probability of selecting a value < 1000 is 40%, we can check how many are less than 1000:

print((selection < 1000).sum() / len(selection)) # should print a number close to 0.4

I compared runtimes between the approach Sam suggested in their answer vs. mine. Results are plotted below for splits = np.array([0, N//2, N]) for increasing N. While using random.choice directly is faster for max(splits) - min(splits) < ~5k, Sam’s approach beats mine handily for the larger inputs.

enter image description here

My timing code is below if you’re interested:

import timeit
import numpy as np
from matplotlib import pyplot as plt

def time_funcs(funcs, sizes, arg_gen, N=20):
    times = np.zeros((len(sizes), len(funcs)))
    gdict = globals().copy()
    for i, s in enumerate(sizes):
        args = arg_gen(s)
        print(args)
        for j, f in enumerate(funcs):
            gdict.update(locals())
            try:
                times[i, j] = timeit.timeit("f(*args)", globals=gdict, number=N) / N
                print(f"{i}/{len(sizes)}, {j}/{len(funcs)}, {times[i, j]}")
            except ValueError:
                print("ERROR in {f}({*args})")
                
            
    return times

def plot_times(times, funcs):
    fig, ax = plt.subplots()
    for j, f in enumerate(funcs):
        ax.plot(sizes, times[:, j], label=f.__name__)
    
    
    ax.set_xlabel("Array size")
    ax.set_ylabel("Time per function call (s)")
    ax.set_xscale("log")
    ax.set_yscale("log")
    ax.legend()
    ax.grid()
    fig.tight_layout()
    return fig, ax

#%%
def arg_gen(n):
    return [np.array([0, n//2, n]), np.array([0.4, 0.6]), n//2]

#%%
def mixture(splits, probs, n):
    rng = np.random.default_rng()
    # draw weighted mixture components
    s = rng.choice(2, n, p=probs)
    # draw uniform values according to component
    return rng.integers(splits[s], splits[s+1])

def choices(splits, probs, n):
    a = np.arange(splits[0], splits[-1])
    n_elem = len(a)
    p = np.zeros_like(a, dtype=float)
    lt1k = a < splits[1]
    n_lt1k = lt1k.sum()
    
    p[lt1k] = probs[0] / n_lt1k
    p[~lt1k] = probs[1] / (n_elem - n_lt1k)
    
    return np.random.choice(a, size=(n,), p=p)

def choices_hc(splits, probs, n):
    assert splits[0] == 0
    a = np.arange(splits[-1])
    p = np.zeros_like(a, dtype=float)
    p[:splits[1]] = probs[0] / splits[1]
    p[splits[1]:] = probs[1] / (splits[2] - splits[1])
    return np.random.choice(a, size=(n,), p=p)
    
#%% 
if __name__ == "__main__":
    #%% Set up sim
    # sizes = [5, 10, 50, 100, 500, 1000, 5000, 10_000, 50_000, 100_000]
    sizes = [5, 10, 50, 100, 500, 1000, 5000, 10_000, 50_000, 100_000, 1_000_000, 5_000_000, 10_000_000]
    funcs = [mixture, choices, choices_hc]
    
    
    #%% Run timing
    time_fcalls = np.zeros((len(sizes), len(funcs))) * np.nan
    time_fcalls = time_funcs(funcs, sizes, arg_gen)
    
    fig, ax = plot_times(time_fcalls, funcs)
    # ax.set_xlabel(f"Input size")
    ax.set_xlabel(f"max(splits) - min(splits)")

    plt.show()
Answered By: Pranav Hosangadi

An alternative would be to treat this as a mixture of two distributions: one that draws uniformly from {0..999} with probability = 0.4, and another that draws uniformly from {1000..65535} with probability = 0.6.

Using choice for the mixture component makes sense, but then I’d use something else to draw the values because when probabilities are passed to choice it does O(len(p)) work every call to transform them. Generator.integers should be more efficient as it can sample your uniform values directly.

Putting this together, I’d suggest using something like:

import numpy as np

rng = np.random.default_rng()

n = 1000
splits = np.array([0, 1000, 65536])

# draw weighted mixture components
s = rng.choice(2, n, p=[0.4, 0.6])
# draw uniform values according to component
result = rng.integers(splits[s], splits[s+1])

You can verify this is drawing from the correct distribution by evaluating np.mean(result < 1000) and checking it’s "close" to 0.4. The variance of that is approximately 0.4*0.6 / n, so, for n=1000, values in [0.37, 0.43] should be seen 95% of the time.

This method should remain fast while max(splits) - min(splits) gets larger, while Pranav’s solution of using choice directly will slow down.

Answered By: Sam Mason
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.