How to Generate a dataset based on mean, median, 1st & 9th decile values?

Question:

I have the following values that describe a dataset:

Number of Samples: 5388
Mean: 4173
Median: 4072
1st Decile: 2720
9th Decile: 5676

I need to generate any datasets that will fit these values.
All the examples I found require you to have the standard deviation which I don’t.
How this can be done?
Thanks!

Asked By: user14070683

||

Answers:

The median fixes the 5388/2 ~ 2694th and 5388/2 + 1 ~ 2695th values (the middle values). So, let’s just make those both 4072. The 1st and 9th decile fix the 5388/10 ~ 538.8th and 9*5388/10 ~ 4849.2th values. There are multiple formulae in vogue for deciles, but it would be safe to set the 538th and 539th values both to 2720. We can similarly obtain the correct 9th decile by fixing the 4849th and 4850th values to 5676. The mean provides one less degree of freedom, but computing the mean involves the actual values from the entire dataset, so we’ll put it off till later. First, what we need to do is to make 537 values lower than 2719. (almost) Doesn’t matter how, but it might be good to make them quite low (to be explained later). Then, we need to make 2693-539 (the number of values between our fixed first decile values and the fixed median values) values between 2720 (the first decile) and 4072 (the median). Then make 4848-2695 values between 4072 and 5676. We now need 5388-4850 (the total number of values minus the 9th decile and lower values) values greater than 5676, but recall that we also need to set the mean. There are literally an infinite number of ways to do this, but one way is to simply make all of the values above the 9th decile identical. To do this, we can compute the mean of the lower 4850 values (which we already have), and realize that (5388 – 4850) * X + 4850 * M = 4173, where M is the mean of the lower 4850 values. Solve for X to obtain the value that you need. Since X must be greater than 5676, it is helpful if you set the values below the first decile to be small, because this gives us some leeway. Another way to do this is to pick random numbers above 5676 for all but one of these values, then pick the last value in such a way to fix the mean (again, it would be wise to pick the random values to not be much above 5676, since the last remaining value can be made arbitrarily large to drag the mean up to the correct value).

In any case, in numpy, you’ll just be generating a bunch of random numbers. np.random.randint should get the job done.

Answered By: Him

Interesting question!
Based on Scott’s suggestions I gave it a quick try.

Inputs:

import random
import pandas as pd
import numpy as np

# fixing the random seed
random.seed(a=1, version=2)
# formating floats
pd.options.display.float_format = '{:.1f}'.format

# given inputs
count = 5388
mean = 4173
median = 4072

lower_percentile = 10
lower_percentile_value = 2720

upper_percentile = 90
upper_percentile_value = 5676

max_value = 6325
min_value = 2101

The Function:

def generate_dataset(count, mean, median, lower_percentile, upper_percentile,
    lower_percentile_value, upper_percentile_value,
    min_value, max_value
    ):
        
    # Calculate the number of values that fall within each percentile
    p_1_size = int(float(lower_percentile) * float(count) / 100)
    p_4_size = int(count - (float(upper_percentile) * float(count) / 100))
    p_2_size = int((count / 2) - p_1_size)
    p_3_size = int((count / 2) - p_4_size)
    
    # can be used to adjust the mean
    mean_adjuster = 5790

    # randomly pick values of right size from a range 
    p_1 = random.choices(range(min_value, lower_percentile_value), k=p_1_size)
    p_2 = random.choices(range(lower_percentile_value, median), k=p_2_size)
    p_3 = random.choices(range(median, mean_adjuster), k=p_3_size)
    p_4 = random.choices(range(upper_percentile_value, max_value), k=p_4_size)
    
    return p_1 + p_2 + p_3 + p_4
    
dataset = generate_dataset(
    count, mean, median, lower_percentile, upper_percentile,
    lower_percentile_value, upper_percentile_value, min_value, max_value
    )

Comparaison:

# converting into DataFrame
df = pd.DataFrame({"x": dataset})

new_count = len(df)
new_mean = np.mean(df.x)
new_median = np.quantile(df.x, 0.5)
new_lower_percentile = np.quantile(df.x, lower_percentile/100)
new_upper_percentile = np.quantile(df.x, upper_percentile/100)

compare = pd.DataFrame(
    {
        "value": ["count", "mean", "median", "low_p", "high_p"],
        "original": [count, mean, median, lower_percentile_value, upper_percentile_value],
        "new":[new_count, new_mean, new_median, new_lower_percentile, new_upper_percentile]
    }
)

print(compare)

Output:

   value  original    new
0   count      5388 5388.0
1    mean      4173 4173.4
2  median      4072 4072.5
3   low_p      2720 2720.4
4  high_p      5676 5743.0

Getting the values to be exactly equal is a bit tricky when all your values are integers and not floats..

You can add another variable to control the mean with two numbers or change the random seed and see if you can get a closer values. Alternatively, you can write a function that changes the seed until the values are equal. (might take couple of minutes or couple of centuries:)

Cheers!

Answered By: Geom

General comment:

If you have specified a quantile function Q(p), then sampling U according to a uniform distribution and plugging in Q(U) gives a draw from the desired distribution.

Answered By: P.Jo