Distribution of elements according to percentage frequency

Question:

Is there any function in pandas, numpy or python which can generate frequency distribution according to the percentage value, like we can do with EnumeratedDistribution in java.

Input:

values = [0, 1, 2]

percentage = [0.5, 0.30, 0.20]

total = 10

Output:

[0, 0, 0, 0, 0, 1, 1, 1, 2, 2]

out of total 10 elements, 50% consists of 0, 30% consists of 1 and 20% consists of 2

Asked By: Amitabh Kumar

||

Answers:

You can use numpy’s repeat() function to repeat values in values by a specified number of times (percentage * total):

import numpy as np


values = [0, 1, 2]

percentage = [0.5, 0.30, 0.20]

total = 11

repeats = np.around(np.array(percentage) * total).astype(np.int8)  # [6, 3, 2]

np.repeat(values, repeats)

Output:

array([0, 0, 0, 0, 0, 0, 1, 1, 1, 2, 2])

I used np.around() function to round the repeats in case they are not whole numbers (e.g. if total is 11 then 11*0.5 -> 6, 11*0.3 -> 3 and 11*0.2 -> 2).

Answered By: Andreas K.

Without using numpy, but only list-comprehension:

values = [0, 1, 2]
percentage = [0.5, 0.30, 0.20]
total = 10

output = sum([[e]*int(total*p) for e,p in zip(values, percentage)], [])
Answered By: FBruzzesi

@Andreas K’s solution is great, but there still has problem regarding to its size of result not always equal to the origin total. E.g [27.3, 36.4, 27.3] = 91 after rounded would be [27, 36, 27] = 90

I prefer this better way of round, by editing a bit from this post https://stackoverflow.com/a/74044227/3789481

def round_retain_sum(x: np.array):
    x = x
    N = np.round(np.sum(x)).astype(int)
    y = x.astype(int)
    M = np.sum(y)
    K = N - M 
    z = y-x 
    if K!=0:
        idx = np.argpartition(z,K)[:K]
        y[idx] += 1     
    return y
import numpy as np

values = [0, 1, 2]
percentage = [0.5, 0.30, 0.20]
total = 11
repeats = round_retain_sum(np.array(percentage) * total)
np.repeat(values, repeats)
Answered By: Tấn Nguyên