# Fast arbitrary distribution random sampling (inverse transform sampling)

## Question:

The `random` module (http://docs.python.org/2/library/random.html) has several fixed functions to randomly sample from. For example `random.gauss` will sample random point from a normal distribution with a given mean and sigma values.

I’m looking for a way to extract a number `N` of random samples between a given interval using my own distribution as fast as possible in `python`. This is what I mean:

``````def my_dist(x):
# Some distribution, assume c1,c2,c3 and c4 are known.
f = c1*exp(-((x-c2)**c3)/c4)
return f

# Draw N random samples from my distribution between given limits a,b.
N = 1000
N_rand_samples = ran_func_sample(my_dist, a, b, N)
``````

where `ran_func_sample` is what I’m after and `a, b` are the limits from which to draw the samples. Is there anything of that sort in `python`?

You need to use Inverse transform sampling method to get random values distributed according to a law you want. Using this method you can just apply inverted function
to random numbers having standard uniform distribution in the interval [0,1].

After you find the inverted function, you get 1000 numbers distributed according to the needed distribution this obvious way:

``````[inverted_function(random.random()) for x in range(1000)]
``````

More on Inverse Transform Sampling:

Also, there is a good question on StackOverflow related to the topic:

This code implements the sampling of n-d discrete probability distributions. By setting a flag on the object, it can also be made to be used as a piecewise constant probability distribution, which can then be used to approximate arbitrary pdf’s. Well, arbitrary pdfs with compact support; if you efficiently want to sample extremely long tails, a non-uniform description of the pdf would be required. But this is still efficient even for things like airy-point-spread functions (which I created it for, initially). The internal sorting of values is absolutely critical there to get accuracy; the many small values in the tails should contribute substantially, but they will get drowned out in fp accuracy without sorting.

``````class Distribution(object):
"""
draws samples from a one dimensional probability distribution,
by means of inversion of a discrete inverstion of a cumulative density function

the pdf can be sorted first to prevent numerical error in the cumulative sum
this is set as default; for big density functions with high contrast,
it is absolutely necessary, and for small density functions,

a call to this distibution object returns indices into density array
"""
def __init__(self, pdf, sort = True, interpolation = True, transform = lambda x: x):
self.shape          = pdf.shape
self.pdf            = pdf.ravel()
self.sort           = sort
self.interpolation  = interpolation
self.transform      = transform

#a pdf can not be negative
assert(np.all(pdf>=0))

#sort the pdf by magnitude
if self.sort:
self.sortindex = np.argsort(self.pdf, axis=None)
self.pdf = self.pdf[self.sortindex]
#construct the cumulative distribution function
self.cdf = np.cumsum(self.pdf)
@property
def ndim(self):
return len(self.shape)
@property
def sum(self):
"""cached sum of all pdf values; the pdf need not sum to one, and is imlpicitly normalized"""
return self.cdf[-1]
def __call__(self, N):
"""draw """
#pick numbers which are uniformly random over the cumulative distribution function
choice = np.random.uniform(high = self.sum, size = N)
#find the indices corresponding to this point on the CDF
index = np.searchsorted(self.cdf, choice)
#if necessary, map the indices back to their original ordering
if self.sort:
index = self.sortindex[index]
#map back to multi-dimensional indexing
index = np.unravel_index(index, self.shape)
index = np.vstack(index)
#is this a discrete or piecewise continuous distribution?
if self.interpolation:
index = index + np.random.uniform(size=index.shape)
return self.transform(index)

if __name__=='__main__':
shape = 3,3
pdf = np.ones(shape)
pdf[1]=0
dist = Distribution(pdf, transform=lambda i:i-1.5)
print dist(10)
import matplotlib.pyplot as pp
pp.scatter(*dist(1000))
pp.show()
``````

And as a more real-world relevant example:

``````x = np.linspace(-100, 100, 512)
p = np.exp(-x**2)
pdf = p[:,None]*p[None,:]     #2d gaussian
dist = Distribution(pdf, transform=lambda i:i-256)
print dist(1000000).mean(axis=1)    #should be in the 1/sqrt(1e6) range
import matplotlib.pyplot as pp
pp.scatter(*dist(1000))
pp.show()
``````
``````import numpy as np
import scipy.interpolate as interpolate

def inverse_transform_sampling(data, n_bins, n_samples):
hist, bin_edges = np.histogram(data, bins=n_bins, density=True)
cum_values = np.zeros(bin_edges.shape)
cum_values[1:] = np.cumsum(hist*np.diff(bin_edges))
inv_cdf = interpolate.interp1d(cum_values, bin_edges)
r = np.random.rand(n_samples)
return inv_cdf(r)
``````

So if we give our data sample that has a specific distribution, the `inverse_transform_sampling` function will return a dataset with exactly the same distribution. Here the advantage is that we can get our own sample size by specifying it in the `n_samples` variable.

I was in a similar situation but I wanted to sample from a multivariate distribution, so, I implemented a rudimentary version of Metropolis-Hastings (which is an MCMC method).

``````def metropolis_hastings(target_density, size=500000):
burnin_size = 10000
size += burnin_size
x0 = np.array([[0, 0]])
xt = x0
samples = []
for i in range(size):
xt_candidate = np.array([np.random.multivariate_normal(xt[0], np.eye(2))])
accept_prob = (target_density(xt_candidate))/(target_density(xt))
if np.random.uniform(0, 1) < accept_prob:
xt = xt_candidate
samples.append(xt)
samples = np.array(samples[burnin_size:])
samples = np.reshape(samples, [samples.shape[0], 2])
return samples
``````

This function requires a function `target_density` which takes in a data-point and computes its probability.

For details check-out this detailed answer of mine.

Here is a rather nice way of performing inverse transform sampling with a decorator.

``````import numpy as np
from scipy.interpolate import interp1d

def inverse_sample_decorator(dist):

def wrapper(pnts, x_min=-100, x_max=100, n=1e5, **kwargs):

x = np.linspace(x_min, x_max, int(n))
cumulative = np.cumsum(dist(x, **kwargs))
cumulative -= cumulative.min()
f = interp1d(cumulative/cumulative.max(), x)
return f(np.random.random(pnts))

return wrapper
``````

Using this decorator on a Gaussian distribution, for example:

``````@inverse_sample_decorator
def gauss(x, amp=1.0, mean=0.0, std=0.2):
return amp*np.exp(-(x-mean)**2/std**2/2.0)
``````

You can then generate sample points from the distribution by calling the function. The keyword arguments `x_min` and `x_max` are the limits of the original distribution and can be passed as arguments to `gauss` along with the other key word arguments that parameterise the distribution.

``````samples = gauss(5000, mean=20, std=0.8, x_min=19, x_max=21)
``````

Alternatively, this can be done as a function that takes the distribution as an argument (as in your original question),

``````def inverse_sample_function(dist, pnts, x_min=-100, x_max=100, n=1e5,
**kwargs):

x = np.linspace(x_min, x_max, int(n))
cumulative = np.cumsum(dist(x, **kwargs))
cumulative -= cumulative.min()
f = interp1d(cumulative/cumulative.max(), x)

return f(np.random.random(pnts))
``````

I’m using numpy.random.choice to simulate coinflips based on any pdf, which is arguably not very fast but it is very simple.

``````import numpy as np
rng = np.random.default_rng(seed=43)

def gaussian(x, mu, sig):
return 1 / (np.sqrt(2 * np.pi) * sig) * np.exp(-np.power((x - mu) / sig, 2) / 2)

x = np.linspace(-4, 4, num=250, endpoint=True)
y = gaussian(x, 0, 1.)
``````

Then normalize to convert to pdf and drawing samples:

``````pdf = y / y.sum()
N = 1000
samples = np.array([rng.choice(x, p=pdf) for i in range(N)])
``````

The timing for my use case is acceptable:

``````%timeit samples = np.array([rng.choice(x, p=pdf) for i in range(N)])
>>> 25.3 ms ± 915 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
``````

Inspired by @Eelco I tried the cdf approach:

``````cdf = np.cumsum(pdf)
x_sample = rng.uniform(0, 1, size=1000)
indices = np.searchsorted(cdf, x_sample)
cdf_samples = x[indices]
``````

This is much faster:

``````%timeit x_sample = rng.uniform(0, 1, size=1000)
# 7.08 µs ± 179 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

%timeit indices = np.searchsorted(cdf, x_sample)
# 37.9 µs ± 933 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

%timeit cdf_samples = x[indices]
1.45 µs ± 17 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
``````

In total about 50 µs. This approach is probably a bit more restrictive compared to what the `Distribution` class offers such as the requirement of having an actual pdf which sums to one, but I think that is a good constraint to impose anyway.

If you want more samples a meaningful histogram with `'fd'` binning, you also need to increase the sampling of the pdf (ie the `np.linspace` size).

Plotting the result:

``````
import proplot as pplt

fig, ax = pplt.subplots()
ax.hist(samples, bins='fd', density=True, histtype='step', lw=2, label='rng.choice')
ax.hist(rng.normal(0, 1, N), bins='fd', density=True, histtype='step', lw=2, label='rng.normal')
ax.hist(cdf_samples,  bins='fd', density=True, histtype='step', lw=2, label='rng.uniform')
ax.plot(x, y, color='k', label='pdf')
ax.format(xlabel='x', ylabel='y')
ax.legend(loc='t', ncols=2)

pplt.show()

``````

Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.