How to apply numpy random.choice to a matrix of probability values (Vectorized solution)
Question:
The problem I have is as follows
I have a 1-D list of integers (or np.array) with 3 values
l = [0,1,2]
I have a 2-D list of probabilities (for simplicity, we’ll use two rows)
P =
[[0.8, 0.1, 0.1],
[0.3, 0.3, 0.4]]
What I want is numpy.random.choice(a=l, p=P)
, where each row in P (probability distribution) is applied to l. So, I want a random sample to be drawn from [0,1,2] with prob. dist. [0.8, 0.1, 0.1] first, then with prob. dist. [0.3, 0.3, 0.4] next, to give me two outputs.
===== Update ======
I can use for loops or list comprehension, but I am looking for a fast/vectorized solution.
Answers:
Here’s one way.
Here’s the array of probabilities:
In [161]: p
Out[161]:
array([[ 0.8 , 0.1 , 0.1 ],
[ 0.3 , 0.3 , 0.4 ],
[ 0.25, 0.5 , 0.25]])
c
holds the cumulative distributions:
In [162]: c = p.cumsum(axis=1)
Generate a set of uniformly distributed samples…
In [163]: u = np.random.rand(len(c), 1)
…and then see where they “fit” in c
:
In [164]: choices = (u < c).argmax(axis=1)
In [165]: choices
Out[165]: array([1, 2, 2])
This question is quite old, but there might be a slightly more elegant solution based on this:
https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.multinomial.html
(I adapted the original input to work as a DataFrame).
# Define the list of choices
choices = ["a", "b", "c"]
# Define the DataFrame of probability distributions
# (In each row, the probabilities of a, b and c can be different)
df_probabilities = pd.DataFrame(data=[[0.8, 0.1, 0.1],
[0.3, 0.3, 0.4]],
columns=choices)
print(df)
a b c
0 0.8 0.1 0.1
1 0.3 0.3 0.4
# Generate a DataFrame of selections. In each row, a 1 denotes
# which choice was selected
rng = np.random.default_rng(42)
df_selections = pd.DataFrame(
data=rng.multinomial(n=1, pvals=df_probabilities),
columns=choices)
print(df_selections)
a b c
0 1 0 0
1 0 1 0
# Finally, reduce the DataFrame to one column (actually pd.Series)
# with the selected choice
df_result = df_selections.idxmax(axis=1)
print(df_result)
0 a
1 b
dtype: object
The problem I have is as follows
I have a 1-D list of integers (or np.array) with 3 values
l = [0,1,2]
I have a 2-D list of probabilities (for simplicity, we’ll use two rows)
P =
[[0.8, 0.1, 0.1],
[0.3, 0.3, 0.4]]
What I want is numpy.random.choice(a=l, p=P)
, where each row in P (probability distribution) is applied to l. So, I want a random sample to be drawn from [0,1,2] with prob. dist. [0.8, 0.1, 0.1] first, then with prob. dist. [0.3, 0.3, 0.4] next, to give me two outputs.
===== Update ======
I can use for loops or list comprehension, but I am looking for a fast/vectorized solution.
Here’s one way.
Here’s the array of probabilities:
In [161]: p
Out[161]:
array([[ 0.8 , 0.1 , 0.1 ],
[ 0.3 , 0.3 , 0.4 ],
[ 0.25, 0.5 , 0.25]])
c
holds the cumulative distributions:
In [162]: c = p.cumsum(axis=1)
Generate a set of uniformly distributed samples…
In [163]: u = np.random.rand(len(c), 1)
…and then see where they “fit” in c
:
In [164]: choices = (u < c).argmax(axis=1)
In [165]: choices
Out[165]: array([1, 2, 2])
This question is quite old, but there might be a slightly more elegant solution based on this:
https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.multinomial.html
(I adapted the original input to work as a DataFrame).
# Define the list of choices
choices = ["a", "b", "c"]
# Define the DataFrame of probability distributions
# (In each row, the probabilities of a, b and c can be different)
df_probabilities = pd.DataFrame(data=[[0.8, 0.1, 0.1],
[0.3, 0.3, 0.4]],
columns=choices)
print(df)
a b c
0 0.8 0.1 0.1
1 0.3 0.3 0.4
# Generate a DataFrame of selections. In each row, a 1 denotes
# which choice was selected
rng = np.random.default_rng(42)
df_selections = pd.DataFrame(
data=rng.multinomial(n=1, pvals=df_probabilities),
columns=choices)
print(df_selections)
a b c
0 1 0 0
1 0 1 0
# Finally, reduce the DataFrame to one column (actually pd.Series)
# with the selected choice
df_result = df_selections.idxmax(axis=1)
print(df_result)
0 a
1 b
dtype: object