Cryptographically-secure, exactly-weighted sampling
Question:
How do I choose k
elements with replacement and weights under the following conditions?
- Randomness must be cryptographically-secure, e.g. as used in the
secrets
module.
- Weighting must be exact, i.e. use integral instead of floating-point arithmetic.
Self-authored code is likely to be less secure and efficient than available implementations. To my best understanding, the following implementations don’t meet my requirements.
Answers:
I would just rip apart the choices implemention from the random module. Something like:
from random import SystemRandom
from itertools import accumulate as _accumulate, repeat as _repeat
from bisect import bisect as _bisect
def choices(population, weights, *, k=1):
randrange = SystemRandom().randrange
n = len(population)
cum_weights = list(_accumulate(weights))
if len(cum_weights) != n:
raise ValueError('The number of weights does not match the population')
total = cum_weights[-1]
if not isinstance(total, int):
raise ValueError('Weights must be integer values')
if total <= 0:
raise ValueError('Total of weights must be greater than zero')
bisect = _bisect
hi = n - 1
return [population[bisect(cum_weights, randrange(total), 0, hi)]
for i in _repeat(None, k)]
which could be tested as:
from collections import Counter
draws = choices([1, 2, 3], [1, 2, 3], k=1_000_000)
print(dict(sorted(Counter(draws).items())))
giving me:
{1: 166150, 2: 333614, 3: 500236}
which looks about right.
Update: just thought to check for off-by-one errors and it seems good here:
print(
choices([1, 2, 3], [1, 0, 0], k=5),
choices([1, 2, 3], [0, 1, 0], k=5),
choices([1, 2, 3], [0, 0, 1], k=5),
)
giving:
[1, 1, 1, 1, 1] [2, 2, 2, 2, 2] [3, 3, 3, 3, 3]
which also seems right.
How do I choose k
elements with replacement and weights under the following conditions?
- Randomness must be cryptographically-secure, e.g. as used in the
secrets
module. - Weighting must be exact, i.e. use integral instead of floating-point arithmetic.
Self-authored code is likely to be less secure and efficient than available implementations. To my best understanding, the following implementations don’t meet my requirements.
I would just rip apart the choices implemention from the random module. Something like:
from random import SystemRandom
from itertools import accumulate as _accumulate, repeat as _repeat
from bisect import bisect as _bisect
def choices(population, weights, *, k=1):
randrange = SystemRandom().randrange
n = len(population)
cum_weights = list(_accumulate(weights))
if len(cum_weights) != n:
raise ValueError('The number of weights does not match the population')
total = cum_weights[-1]
if not isinstance(total, int):
raise ValueError('Weights must be integer values')
if total <= 0:
raise ValueError('Total of weights must be greater than zero')
bisect = _bisect
hi = n - 1
return [population[bisect(cum_weights, randrange(total), 0, hi)]
for i in _repeat(None, k)]
which could be tested as:
from collections import Counter
draws = choices([1, 2, 3], [1, 2, 3], k=1_000_000)
print(dict(sorted(Counter(draws).items())))
giving me:
{1: 166150, 2: 333614, 3: 500236}
which looks about right.
Update: just thought to check for off-by-one errors and it seems good here:
print(
choices([1, 2, 3], [1, 0, 0], k=5),
choices([1, 2, 3], [0, 1, 0], k=5),
choices([1, 2, 3], [0, 0, 1], k=5),
)
giving:
[1, 1, 1, 1, 1] [2, 2, 2, 2, 2] [3, 3, 3, 3, 3]
which also seems right.