Data generated from Scipy truncnorm.rvs does not match specified standard deviation

Question:

I am trying to generate data which follow specified truncated normal distribution. Based on answers here and here, I wrote,

lower,upper,mu,sigma,N = 5,15,10,5,10000
samples = scipy.stats.truncnorm.rvs((lower-mu)/sigma,(upper-mu)/sigma,loc=mu,scale=sigma,size=N)
samples.std()

I get output like

> 2.673

Which is obviously nowhere close to expected value of 5. Repeating it does not changes it considerably so it’s not sample size issue. Any suggestions?

Asked By: Martan

||

Answers:

This is generating a clipped normal distribution between [5,15]. This is +/- 1 s.d, so the s.d. measured across this sample will not be equal to the input.

If you clip the range of outputs, you necessarily reduce the s.d. observed.

As lower/upper -> +/-infinity, the sample std -> 5.
As lower/upper -> 10, the sample std -> 0.

Answered By: user157545

Indeed, truncating the normal distribution reduces the variability (and thereby standard deviation) of the possible realizations of the random variable. Regardless, we know why it is not 5.0. But we really don’t know why it should be 2.673 either; except for the fact that it is smaller.

What if we compute the exact standard deviation for the truncated normal distribution analytically and compare it to the empirical value you retrieved? In this case, you can be sure that everything checks out.

from scipy import stats
from scipy.integrate import quad
import numpy as np
from matplotlib import pyplot as plt


# re-normalization constant (inverse of prob. of normal dist. on interval [lower, upper])
p = stats.norm.cdf(upper, loc=mu, scale=sigma) - stats.norm.cdf(lower, loc=mu, scale=sigma)

# plot 
x_axis = np.linspace(0, 25, 10000)
plt.title('Truncated Normal Density', fontsize=18)
plt.plot(x_axis, scipy.stats.truncnorm.pdf(x_axis, (lower-mu)/sigma, (upper-mu)/sigma, loc=mu, scale=sigma))

plt.show()

enter image description here

showcases the truncated normal density alluding to the fact that the narrower the interval [lower, upper] are chosen, the smaller the standard deviation will be (even approaching 0 asymptotically when lower and upper get infinitesimally close).

Let’s make this rigorous to really be sure. Given the age-old equations for the expected value and variance of our (truncated normal random variable X) we have

enter image description here

Then, defining the helper functions

def xfx(x, lower=lower, upper=upper, mu=mu, sigma=sigma):
    '''helper function returning x*f(x) for the truncated normal density f'''
    return x*scipy.stats.truncnorm.pdf(x, (lower-mu)/sigma, (upper-mu)/sigma, loc=mu, scale=sigma)

def x_EX_fx(x, lower=lower, upper=upper, mu=mu, sigma=sigma):
    '''helper function returning (x - E[X])**2 * f(x) for the truncated normal density f'''
    EX = quad(func=xfx,a=lower,b=upper)[0]
    return ((x - EX)**2) * scipy.stats.truncnorm.pdf(x, (lower-mu)/sigma, (upper-mu)/sigma, loc=mu, scale=sigma)

allows us to compute the values exact

# E[X], expected value of X
quad(func=xfx,a=lower,b=upper)[0]
> 10.0

# (Var(X))^(1/2), standard deviation of X
np.sqrt(quad(func=x_EX_fx,a=lower,b=upper)[0])
> 2.697

This looks eerily similar to your observed value 2.673. Let’s see if the difference is merely based on the finite sample size by running a simulation study to observe if the empirical standard deviation approaches the theoretical one.

# simulation study
np.random.seed(7447)
stdList = [scipy.stats.truncnorm.rvs((lower-mu)/sigma, (upper-mu)/sigma, loc=mu, scale=sigma, size=round(10**N)).std() for N in range(2,8)]

# plot 
plt.title("Convergence behaviour of $hat{σ}_{n}$ to σ", fontsize=18)
plt.plot(range(2,8), stdList)
plt.axhline(2.697800468774485, color='red', lw=0.85)
plt.legend({'emprical' : 'blue', 'theoretical' : 'red'}, fontsize=14)
plt.xlabel("$log_{10}(N)$", fontsize=14)
plt.show()

yielding

enter image description here

This confirms that your output is sound,

Answered By: 7shoe
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.