Distribution of elements according to percentage frequency
Question:
Is there any function in pandas, numpy or python which can generate frequency distribution according to the percentage value, like we can do with EnumeratedDistribution in java.
Input:
values = [0, 1, 2]
percentage = [0.5, 0.30, 0.20]
total = 10
Output:
[0, 0, 0, 0, 0, 1, 1, 1, 2, 2]
out of total 10 elements, 50% consists of 0, 30% consists of 1 and 20% consists of 2
Answers:
You can use numpy’s repeat()
function to repeat values in values
by a specified number of times (percentage * total):
import numpy as np
values = [0, 1, 2]
percentage = [0.5, 0.30, 0.20]
total = 11
repeats = np.around(np.array(percentage) * total).astype(np.int8) # [6, 3, 2]
np.repeat(values, repeats)
Output:
array([0, 0, 0, 0, 0, 0, 1, 1, 1, 2, 2])
I used np.around()
function to round the repeats in case they are not whole numbers (e.g. if total is 11 then 11*0.5 -> 6
, 11*0.3 -> 3
and 11*0.2 -> 2
).
Without using numpy, but only list-comprehension:
values = [0, 1, 2]
percentage = [0.5, 0.30, 0.20]
total = 10
output = sum([[e]*int(total*p) for e,p in zip(values, percentage)], [])
@Andreas K’s solution is great, but there still has problem regarding to its size of result not always equal to the origin total. E.g [27.3, 36.4, 27.3] = 91 after rounded would be [27, 36, 27] = 90
I prefer this better way of round, by editing a bit from this post https://stackoverflow.com/a/74044227/3789481
def round_retain_sum(x: np.array):
x = x
N = np.round(np.sum(x)).astype(int)
y = x.astype(int)
M = np.sum(y)
K = N - M
z = y-x
if K!=0:
idx = np.argpartition(z,K)[:K]
y[idx] += 1
return y
import numpy as np
values = [0, 1, 2]
percentage = [0.5, 0.30, 0.20]
total = 11
repeats = round_retain_sum(np.array(percentage) * total)
np.repeat(values, repeats)
Is there any function in pandas, numpy or python which can generate frequency distribution according to the percentage value, like we can do with EnumeratedDistribution in java.
Input:
values = [0, 1, 2]
percentage = [0.5, 0.30, 0.20]
total = 10
Output:
[0, 0, 0, 0, 0, 1, 1, 1, 2, 2]
out of total 10 elements, 50% consists of 0, 30% consists of 1 and 20% consists of 2
You can use numpy’s repeat()
function to repeat values in values
by a specified number of times (percentage * total):
import numpy as np
values = [0, 1, 2]
percentage = [0.5, 0.30, 0.20]
total = 11
repeats = np.around(np.array(percentage) * total).astype(np.int8) # [6, 3, 2]
np.repeat(values, repeats)
Output:
array([0, 0, 0, 0, 0, 0, 1, 1, 1, 2, 2])
I used np.around()
function to round the repeats in case they are not whole numbers (e.g. if total is 11 then 11*0.5 -> 6
, 11*0.3 -> 3
and 11*0.2 -> 2
).
Without using numpy, but only list-comprehension:
values = [0, 1, 2]
percentage = [0.5, 0.30, 0.20]
total = 10
output = sum([[e]*int(total*p) for e,p in zip(values, percentage)], [])
@Andreas K’s solution is great, but there still has problem regarding to its size of result not always equal to the origin total. E.g [27.3, 36.4, 27.3] = 91 after rounded would be [27, 36, 27] = 90
I prefer this better way of round, by editing a bit from this post https://stackoverflow.com/a/74044227/3789481
def round_retain_sum(x: np.array):
x = x
N = np.round(np.sum(x)).astype(int)
y = x.astype(int)
M = np.sum(y)
K = N - M
z = y-x
if K!=0:
idx = np.argpartition(z,K)[:K]
y[idx] += 1
return y
import numpy as np
values = [0, 1, 2]
percentage = [0.5, 0.30, 0.20]
total = 11
repeats = round_retain_sum(np.array(percentage) * total)
np.repeat(values, repeats)