Given a 2D Numpy array representing a 2D distribution, how to sample data from this distribution with the aid of Numpy or Scipy functions?
Question:
Given a 2D numpy array dist
with shape (200,200)
, where each entry of the array represents the joint probability of (x1, x2) for all x1 , x2 ∈ {0, 1, . . . , 199}. How do I sample bivariate data x = (x1, x2) from this probability distribution with the aid of Numpy or Scipy API?
Answers:
Here’s a way, but I’m sure there’s a much more elegant solution using scipy.
numpy.random
doesn’t deal with 2d pmfs, so you have to do some reshaping gymnastics to go this way.
import numpy as np
# construct a toy joint pmf
dist=np.random.random(size=(200,200)) # here's your joint pmf
dist/=dist.sum() # it has to be normalized
# generate the set of all x,y pairs represented by the pmf
pairs=np.indices(dimensions=(200,200)).T # here are all of the x,y pairs
# make n random selections from the flattened pmf without replacement
# whether you want replacement depends on your application
n=50
inds=np.random.choice(np.arange(200**2),p=dist.reshape(-1),size=n,replace=False)
# inds is the set of n randomly chosen indicies into the flattened dist array...
# therefore the random x,y selections
# come from selecting the associated elements
# from the flattened pairs array
selections = pairs.reshape(-1,2)[inds]
I can’t comment, but to improve kevinkayaks answer’s :
pairs=np.indices(dimensions=(200,200)).T
selections = pairs.reshape(-1,2)[inds]
Is not needed can be replace by :
np.array([inds//m, inds%m]).T
The matrix “pairs” is not needed anymore.
This solution works with probability distributions of any number of dimensions, assuming they are a valid probability distribution (its contents must sum to 1, etc.). It flattens the distribution, samples from that, and adjusts the random index to match the original array shape.
# Create a flat copy of the array
flat = array.flatten()
# Then, sample an index from the 1D array with the
# probability distribution from the original array
sample_index = np.random.choice(a=flat.size, p=flat)
# Take this index and adjust it so it matches the original array
adjusted_index = np.unravel_index(sample_index, array.shape)
print(adjusted_index)
Also, to get multiple samples, add a size
keyword argument to the np.random.choice
call, and modify adjusted_index
before printing it:
adjusted_index = np.array(zip(*adjusted_index))
This is necessary because np.random.choice
with a size
argument outputs a list of indices for each coordinate dimension, so this zips them into a list of coordinate tuples. This is also much more efficient than simply repeating the first code.
Relevant documentation:
I can’t comment either, but @applemonkey496 ‘s suggestion for getting multiple samples doesn’t work as written. It’s an excellent solution otherwise.
Instead of
adjusted_index = np.array(zip(*adjusted_index))
adjusted_index should be converted to a python list before trying to put it into a numpy array (numpy arrays do not accept zipped objects), eg:
adjusted_index = np.array(list(zip(*adjusted_index)))
Given a 2D numpy array dist
with shape (200,200)
, where each entry of the array represents the joint probability of (x1, x2) for all x1 , x2 ∈ {0, 1, . . . , 199}. How do I sample bivariate data x = (x1, x2) from this probability distribution with the aid of Numpy or Scipy API?
Here’s a way, but I’m sure there’s a much more elegant solution using scipy.
numpy.random
doesn’t deal with 2d pmfs, so you have to do some reshaping gymnastics to go this way.
import numpy as np
# construct a toy joint pmf
dist=np.random.random(size=(200,200)) # here's your joint pmf
dist/=dist.sum() # it has to be normalized
# generate the set of all x,y pairs represented by the pmf
pairs=np.indices(dimensions=(200,200)).T # here are all of the x,y pairs
# make n random selections from the flattened pmf without replacement
# whether you want replacement depends on your application
n=50
inds=np.random.choice(np.arange(200**2),p=dist.reshape(-1),size=n,replace=False)
# inds is the set of n randomly chosen indicies into the flattened dist array...
# therefore the random x,y selections
# come from selecting the associated elements
# from the flattened pairs array
selections = pairs.reshape(-1,2)[inds]
I can’t comment, but to improve kevinkayaks answer’s :
pairs=np.indices(dimensions=(200,200)).T
selections = pairs.reshape(-1,2)[inds]
Is not needed can be replace by :
np.array([inds//m, inds%m]).T
The matrix “pairs” is not needed anymore.
This solution works with probability distributions of any number of dimensions, assuming they are a valid probability distribution (its contents must sum to 1, etc.). It flattens the distribution, samples from that, and adjusts the random index to match the original array shape.
# Create a flat copy of the array
flat = array.flatten()
# Then, sample an index from the 1D array with the
# probability distribution from the original array
sample_index = np.random.choice(a=flat.size, p=flat)
# Take this index and adjust it so it matches the original array
adjusted_index = np.unravel_index(sample_index, array.shape)
print(adjusted_index)
Also, to get multiple samples, add a size
keyword argument to the np.random.choice
call, and modify adjusted_index
before printing it:
adjusted_index = np.array(zip(*adjusted_index))
This is necessary because np.random.choice
with a size
argument outputs a list of indices for each coordinate dimension, so this zips them into a list of coordinate tuples. This is also much more efficient than simply repeating the first code.
Relevant documentation:
I can’t comment either, but @applemonkey496 ‘s suggestion for getting multiple samples doesn’t work as written. It’s an excellent solution otherwise.
Instead of
adjusted_index = np.array(zip(*adjusted_index))
adjusted_index should be converted to a python list before trying to put it into a numpy array (numpy arrays do not accept zipped objects), eg:
adjusted_index = np.array(list(zip(*adjusted_index)))