np.random.choice with a big probabilities array
Question:
I know that we can use a probability array for the choice function, but my question is how it works for big arrays. Let’s assume that I want to have 1 thousand random numbers between 0-65535. How can we define the probability array to have p=0.4 for numbers less than 1000 and p=0.6 for the rest?
I tried to pass the range of numbers to the choice function, but apparently, it doesn’t work like that.
Answers:
From the docs, each element of the argument p
gives the probability for the corresponding element in a
.
Since p
and a
need to have the same size, create a p
of the same size as a
:
a = np.arange(65536)
n_elem = len(a)
p = np.zeros_like(a, dtype=float)
Now, find all the elements of a
less than 1000
, and set p
for those indices to 0.4 divided by the number of elements less than 1000. For this case, you can hardcode that calculation, since you know which elements of an arange
are less than 1000:
p[:1000] = 0.4 / 1000
p[1000:] = 0.6 / 64536
For the general case where a
is not derived from an arange
, you could do:
lt1k = a < 1000
n_lt1k = lt1k.sum()
p[lt1k] = 0.4 / n_lt1k
p[~lt1k] = 0.6 / (n_elem - n_lt1k)
Note that p
must sum to 1
:
assert np.allclose(p.sum(), 1.0)
Now use a
and p
in choice
:
selection = np.random.choice(a, size=(1000,), p=p)
To verify that the probability of selecting a value < 1000 is 40%, we can check how many are less than 1000:
print((selection < 1000).sum() / len(selection)) # should print a number close to 0.4
I compared runtimes between the approach Sam suggested in their answer vs. mine. Results are plotted below for splits = np.array([0, N//2, N])
for increasing N
. While using random.choice
directly is faster for max(splits) - min(splits) < ~5k
, Sam’s approach beats mine handily for the larger inputs.
My timing code is below if you’re interested:
import timeit
import numpy as np
from matplotlib import pyplot as plt
def time_funcs(funcs, sizes, arg_gen, N=20):
times = np.zeros((len(sizes), len(funcs)))
gdict = globals().copy()
for i, s in enumerate(sizes):
args = arg_gen(s)
print(args)
for j, f in enumerate(funcs):
gdict.update(locals())
try:
times[i, j] = timeit.timeit("f(*args)", globals=gdict, number=N) / N
print(f"{i}/{len(sizes)}, {j}/{len(funcs)}, {times[i, j]}")
except ValueError:
print("ERROR in {f}({*args})")
return times
def plot_times(times, funcs):
fig, ax = plt.subplots()
for j, f in enumerate(funcs):
ax.plot(sizes, times[:, j], label=f.__name__)
ax.set_xlabel("Array size")
ax.set_ylabel("Time per function call (s)")
ax.set_xscale("log")
ax.set_yscale("log")
ax.legend()
ax.grid()
fig.tight_layout()
return fig, ax
#%%
def arg_gen(n):
return [np.array([0, n//2, n]), np.array([0.4, 0.6]), n//2]
#%%
def mixture(splits, probs, n):
rng = np.random.default_rng()
# draw weighted mixture components
s = rng.choice(2, n, p=probs)
# draw uniform values according to component
return rng.integers(splits[s], splits[s+1])
def choices(splits, probs, n):
a = np.arange(splits[0], splits[-1])
n_elem = len(a)
p = np.zeros_like(a, dtype=float)
lt1k = a < splits[1]
n_lt1k = lt1k.sum()
p[lt1k] = probs[0] / n_lt1k
p[~lt1k] = probs[1] / (n_elem - n_lt1k)
return np.random.choice(a, size=(n,), p=p)
def choices_hc(splits, probs, n):
assert splits[0] == 0
a = np.arange(splits[-1])
p = np.zeros_like(a, dtype=float)
p[:splits[1]] = probs[0] / splits[1]
p[splits[1]:] = probs[1] / (splits[2] - splits[1])
return np.random.choice(a, size=(n,), p=p)
#%%
if __name__ == "__main__":
#%% Set up sim
# sizes = [5, 10, 50, 100, 500, 1000, 5000, 10_000, 50_000, 100_000]
sizes = [5, 10, 50, 100, 500, 1000, 5000, 10_000, 50_000, 100_000, 1_000_000, 5_000_000, 10_000_000]
funcs = [mixture, choices, choices_hc]
#%% Run timing
time_fcalls = np.zeros((len(sizes), len(funcs))) * np.nan
time_fcalls = time_funcs(funcs, sizes, arg_gen)
fig, ax = plot_times(time_fcalls, funcs)
# ax.set_xlabel(f"Input size")
ax.set_xlabel(f"max(splits) - min(splits)")
plt.show()
An alternative would be to treat this as a mixture of two distributions: one that draws uniformly from {0..999} with probability = 0.4, and another that draws uniformly from {1000..65535} with probability = 0.6.
Using choice
for the mixture component makes sense, but then I’d use something else to draw the values because when probabilities are passed to choice
it does O(len(p)
) work every call to transform them. Generator.integers
should be more efficient as it can sample your uniform values directly.
Putting this together, I’d suggest using something like:
import numpy as np
rng = np.random.default_rng()
n = 1000
splits = np.array([0, 1000, 65536])
# draw weighted mixture components
s = rng.choice(2, n, p=[0.4, 0.6])
# draw uniform values according to component
result = rng.integers(splits[s], splits[s+1])
You can verify this is drawing from the correct distribution by evaluating np.mean(result < 1000)
and checking it’s "close" to 0.4. The variance of that is approximately 0.4*0.6 / n
, so, for n=1000
, values in [0.37, 0.43] should be seen 95% of the time.
This method should remain fast while max(splits) - min(splits)
gets larger, while Pranav’s solution of using choice
directly will slow down.
I know that we can use a probability array for the choice function, but my question is how it works for big arrays. Let’s assume that I want to have 1 thousand random numbers between 0-65535. How can we define the probability array to have p=0.4 for numbers less than 1000 and p=0.6 for the rest?
I tried to pass the range of numbers to the choice function, but apparently, it doesn’t work like that.
From the docs, each element of the argument p
gives the probability for the corresponding element in a
.
Since p
and a
need to have the same size, create a p
of the same size as a
:
a = np.arange(65536)
n_elem = len(a)
p = np.zeros_like(a, dtype=float)
Now, find all the elements of a
less than 1000
, and set p
for those indices to 0.4 divided by the number of elements less than 1000. For this case, you can hardcode that calculation, since you know which elements of an arange
are less than 1000:
p[:1000] = 0.4 / 1000
p[1000:] = 0.6 / 64536
For the general case where a
is not derived from an arange
, you could do:
lt1k = a < 1000
n_lt1k = lt1k.sum()
p[lt1k] = 0.4 / n_lt1k
p[~lt1k] = 0.6 / (n_elem - n_lt1k)
Note that p
must sum to 1
:
assert np.allclose(p.sum(), 1.0)
Now use a
and p
in choice
:
selection = np.random.choice(a, size=(1000,), p=p)
To verify that the probability of selecting a value < 1000 is 40%, we can check how many are less than 1000:
print((selection < 1000).sum() / len(selection)) # should print a number close to 0.4
I compared runtimes between the approach Sam suggested in their answer vs. mine. Results are plotted below for splits = np.array([0, N//2, N])
for increasing N
. While using random.choice
directly is faster for max(splits) - min(splits) < ~5k
, Sam’s approach beats mine handily for the larger inputs.
My timing code is below if you’re interested:
import timeit
import numpy as np
from matplotlib import pyplot as plt
def time_funcs(funcs, sizes, arg_gen, N=20):
times = np.zeros((len(sizes), len(funcs)))
gdict = globals().copy()
for i, s in enumerate(sizes):
args = arg_gen(s)
print(args)
for j, f in enumerate(funcs):
gdict.update(locals())
try:
times[i, j] = timeit.timeit("f(*args)", globals=gdict, number=N) / N
print(f"{i}/{len(sizes)}, {j}/{len(funcs)}, {times[i, j]}")
except ValueError:
print("ERROR in {f}({*args})")
return times
def plot_times(times, funcs):
fig, ax = plt.subplots()
for j, f in enumerate(funcs):
ax.plot(sizes, times[:, j], label=f.__name__)
ax.set_xlabel("Array size")
ax.set_ylabel("Time per function call (s)")
ax.set_xscale("log")
ax.set_yscale("log")
ax.legend()
ax.grid()
fig.tight_layout()
return fig, ax
#%%
def arg_gen(n):
return [np.array([0, n//2, n]), np.array([0.4, 0.6]), n//2]
#%%
def mixture(splits, probs, n):
rng = np.random.default_rng()
# draw weighted mixture components
s = rng.choice(2, n, p=probs)
# draw uniform values according to component
return rng.integers(splits[s], splits[s+1])
def choices(splits, probs, n):
a = np.arange(splits[0], splits[-1])
n_elem = len(a)
p = np.zeros_like(a, dtype=float)
lt1k = a < splits[1]
n_lt1k = lt1k.sum()
p[lt1k] = probs[0] / n_lt1k
p[~lt1k] = probs[1] / (n_elem - n_lt1k)
return np.random.choice(a, size=(n,), p=p)
def choices_hc(splits, probs, n):
assert splits[0] == 0
a = np.arange(splits[-1])
p = np.zeros_like(a, dtype=float)
p[:splits[1]] = probs[0] / splits[1]
p[splits[1]:] = probs[1] / (splits[2] - splits[1])
return np.random.choice(a, size=(n,), p=p)
#%%
if __name__ == "__main__":
#%% Set up sim
# sizes = [5, 10, 50, 100, 500, 1000, 5000, 10_000, 50_000, 100_000]
sizes = [5, 10, 50, 100, 500, 1000, 5000, 10_000, 50_000, 100_000, 1_000_000, 5_000_000, 10_000_000]
funcs = [mixture, choices, choices_hc]
#%% Run timing
time_fcalls = np.zeros((len(sizes), len(funcs))) * np.nan
time_fcalls = time_funcs(funcs, sizes, arg_gen)
fig, ax = plot_times(time_fcalls, funcs)
# ax.set_xlabel(f"Input size")
ax.set_xlabel(f"max(splits) - min(splits)")
plt.show()
An alternative would be to treat this as a mixture of two distributions: one that draws uniformly from {0..999} with probability = 0.4, and another that draws uniformly from {1000..65535} with probability = 0.6.
Using choice
for the mixture component makes sense, but then I’d use something else to draw the values because when probabilities are passed to choice
it does O(len(p)
) work every call to transform them. Generator.integers
should be more efficient as it can sample your uniform values directly.
Putting this together, I’d suggest using something like:
import numpy as np
rng = np.random.default_rng()
n = 1000
splits = np.array([0, 1000, 65536])
# draw weighted mixture components
s = rng.choice(2, n, p=[0.4, 0.6])
# draw uniform values according to component
result = rng.integers(splits[s], splits[s+1])
You can verify this is drawing from the correct distribution by evaluating np.mean(result < 1000)
and checking it’s "close" to 0.4. The variance of that is approximately 0.4*0.6 / n
, so, for n=1000
, values in [0.37, 0.43] should be seen 95% of the time.
This method should remain fast while max(splits) - min(splits)
gets larger, while Pranav’s solution of using choice
directly will slow down.