Generate sample data with an exact Mean and Standard Deviation

Question:

I wanted to create a data set with a specific Mean and Std deviation.

Using np.random.normal() gives me an approximate. However for what I want to test I need an exact Mean and Std deviation.

I have tried using a combination of norm.pdf and np.linspace however the data set generated doesn’t match up either (It could just be me misusing it though).

It really doesn’t matter whether the data set is random or not as long as I can set a specific Sample size, mean and Std deviation.

Help would be much appreciated

Asked By: Oliver Brace

||

Answers:

The easiest would be to generate some zero-mean samples, with the desired standard deviation. Then subtract the sample mean from the samples so it is truly zero mean. Then scale the samples so that the standard deviation is spot on, and then add the desired mean.

Here is some example code:

import numpy as np

num_samples = 1000
desired_mean = 50.0
desired_std_dev = 10.0

samples = np.random.normal(loc=0.0, scale=desired_std_dev, size=num_samples)

actual_mean = np.mean(samples)
actual_std = np.std(samples)
print("Initial samples stats   : mean = {:.4f} stdv = {:.4f}".format(actual_mean, actual_std))

zero_mean_samples = samples - (actual_mean)

zero_mean_mean = np.mean(zero_mean_samples)
zero_mean_std = np.std(zero_mean_samples)
print("True zero samples stats : mean = {:.4f} stdv = {:.4f}".format(zero_mean_mean, zero_mean_std))

scaled_samples = zero_mean_samples * (desired_std_dev/zero_mean_std)
scaled_mean = np.mean(scaled_samples)
scaled_std = np.std(scaled_samples)
print("Scaled samples stats    : mean = {:.4f} stdv = {:.4f}".format(scaled_mean, scaled_std))

final_samples = scaled_samples + desired_mean
final_mean = np.mean(final_samples)
final_std = np.std(final_samples)
print("Final samples stats     : mean = {:.4f} stdv = {:.4f}".format(final_mean, final_std))

Which produces output similar to this:

Initial samples stats   : mean = 0.2946 stdv = 10.1609
True zero samples stats : mean = 0.0000 stdv = 10.1609
Scaled samples stats    : mean = 0.0000 stdv = 10.0000
Final samples stats     : mean = 50.0000 stdv = 10.0000
Answered By: Spoonless

For others seeing this later, Python 3.8+ has the statistics.NormalDist class for exactly this purpose:

import statistics as s
n = s.NormalDist(mu=10, sigma=2)
samples = n.samples(100_000, seed=42)  # remove seed if desired
print(s.mean(samples))  # 10.004521585462394
print(s.stdev(samples))  # 2.0052615406360457

Methods from @Spoonless’s answer can be used to tweak the exact mean and stdev of the samples if needed, or one can just use a large enough number of samples to get exceedingly close — this is statistics, after all.

Answered By: Brendano257

You can also do this with the random library.

import random as rand
mean = 20.9
stdd = 3
samples = 1000
data = [rand.normalvariate(mean, stdd) for i in range(samples)]

I also needed to generate data with residuals, so I simply added the product of a rand.randomrange(-1,1) with the residual.

data = [rand.normalvariate(mean, stdd)+(rand.randrange(-1,1)*residual) for i in range(samples)]

Note by adding residuals you will throw off the exact mean and stdd slightly.

Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.