Sampling Without Replacement Probabilities

Question:

I am using np.random.choice to do sampling without replacement.

I would like the following code to choose 0 50% of the time, 1 30% of the time, and 2 20% of the time.

import numpy as np

draws = []
for _ in range(10000):
    draw = np.random.choice(3, size=2, replace=False, p=[0.5, 0.3, 0.2])
    draws.append(draw)

result = np.r_[draws]

How can I correctly choose the parameters for np.random.choice to give me the result that I want?

The numbers I want represent the probability of the events being drawn in either 1st or 2nd position exclusively.

print(np.any(result==0, axis=1).mean()) # 0.83, want 0.8
print(np.any(result==1, axis=1).mean()) # 0.68, want 0.7
print(np.any(result==2, axis=1).mean()) # 0.47, want 0.5
Asked By: Ginger

||

Answers:

I’m giving two interpretations of the problem. One I prefer (“Timeless”) and one I consider technically valid but inferior (“Naive”)

Timeless:

Given probabilities x, y, z this approach computes x', y', z' such that if we draw twice independently and discard all equal pairs the frequencies of 0, 1, 2 are x, y, z.

This gives the right total frequencies over both trials and has the added benefit of being simple and being timeless in the sense that first and second trial are equivalent.

For this to hold we must have

(x'y' + x'z') / [2 (x'y' + x'z' + y'z')] = x
(x'y' + y'z') / [2 (x'y' + x'z' + y'z')] = y                         (1)
(y'z' + x'z') / [2 (x'y' + x'z' + y'z')] = z

If we add two of those and subtract the third we get

x'y' / (x'y' + x'z' + y'z') =  x + y - z = 1 - 2 z
x'z' / (x'y' + x'z' + y'z') =  x - y + z = 1 - 2 y                   (2)
y'z' / (x'y' + x'z' + y'z') = -x + y + z = 1 - 2 x

Multiplying 2 of those and dividing by the third

x'^2 / (x'y' + x'z' + y'z') = (1 - 2 z) (1 - 2 y) / (1 - 2 x)
y'^2 / (x'y' + x'z' + y'z') = (1 - 2 z) (1 - 2 x) / (1 - 2 y)        (3)
z'^2 / (x'y' + x'z' + y'z') = (1 - 2 x) (1 - 2 y) / (1 - 2 z)

Therefore up to a constant factor

x' ~ sqrt[(1 - 2 z) (1 - 2 y) / (1 - 2 x)]
y' ~ sqrt[(1 - 2 z) (1 - 2 x) / (1 - 2 y)]                           (4)
z' ~ sqrt[(1 - 2 x) (1 - 2 y) / (1 - 2 z)]

Since we know that x', y', z' must sum to one this is enough to solve.

But: we needn’t actually completely solve for x', y', z'. Since we are only interested in unequal pairs, all we need are the conditional probabilities x'y' / (x'y' + x'z' + y'z'), x'z' / (x'y' + x'z' + y'z') and y'z' / (x'y' + x'z' + y'z'). These we can compute using equation (2).

We then halve each of them to get the probabilities for ordered pairs and draw from the six legal pairs with these probabilities.

Naive:

This is based on the (arbitrary in my opinion) postulate that after the first draw with probability x', y', z', the second must have conditional probability 0, y' / (y'+z'), z' / (y'+z') if first was 0 x' / (x'+z'), 0, z' / (x'+z') if first was 1 and probability x' / (x'+y'), y' / (x'+y'), 0) if first was 2.

This has the disadvantage that as far as I can tell there is no simple, closed-form solution and the second and first draws are quite different.

The advantage is that one can use it directly with np.random.choice; this is, however, so slow that in the implementation below I give a workaround that avoids this function.

After some algebra one finds:

1/x' - x' = c (1 - 2x)
1/y' - y' = c (1 - 2y)
1/z' - z' = c (1 - 2z)

where c = 1/x' + 1/y' + 1/z' - 1. This I only managed to solve numerically.

Implementation and results:

And here is the implementation.

import numpy as np
from scipy import optimize

def f_pairs(n, p):
    p = np.asanyarray(p)
    p /= p.sum()
    assert np.all(p <= 0.5)
    pp = 1 - 2*p

    # the following two lines show how to compute x', y', z'
    # pp = np.sqrt(pp.prod()) / pp
    # pp /= pp.sum()
    # now pp contains x', y', z'

    i, j = np.triu_indices(3, 1)
    i, j = i[::-1], j[::-1]
    pairs = np.c_[np.r_[i, j], np.r_[j, i]]
    pp6 = np.r_[pp/2, pp/2]
    return pairs[np.random.choice(6, size=(n,), replace=True, p=pp6)]

def f_opt(n, p):
    p = np.asanyarray(p)
    p /= p.sum()
    pp = 1 - 2*p
    def target(l):
        lp2 = l*pp/2
        return (np.sqrt(1 + lp2**2) - lp2).sum() - 1
    l = optimize.root(target, 8).x
    lp2 = l*pp/2
    pp = np.sqrt(1 + lp2**2) - lp2
    fst = np.random.choice(3, size=(n,), replace=True, p=pp)
    snd = (
        (np.random.random((n,)) < (1 / (1 + (pp[(fst+1)%3] / pp[(fst-1)%3]))))
        + fst + 1) % 3
    return np.c_[fst, snd]

def f_naive(n, p):
    p = np.asanyarray(p)
    p /= p.sum()
    pp = 1 - 2*p
    def target(l):
        lp2 = l*pp/2
        return (np.sqrt(1 + lp2**2) - lp2).sum() - 1
    l = optimize.root(target, 8).x
    lp2 = l*pp/2
    pp = np.sqrt(1 + lp2**2) - lp2
    return np.array([np.random.choice(3, (2,), replace=False, p=pp)
                    for _ in range(n)])

def check_sol(p, sol):
    N = len(sol)
    print("Frequencies [value: observed, desired]")
    c1 = np.bincount(sol[:, 0], minlength=3) / N
    print(f"1st column:  0: {c1[0]:8.6f} {p[0]:8.6f}  1: {c1[1]:8.6f} {p[1]:8.6f}  2: {c1[2]:8.6f} {p[2]:8.6f}")
    c2 = np.bincount(sol[:, 1], minlength=3) / N
    print(f"2nd column:  0: {c2[0]:8.6f} {p[0]:8.6f}  1: {c2[1]:8.6f} {p[1]:8.6f}  2: {c2[2]:8.6f} {p[2]:8.6f}")
    c = c1 + c2
    print(f"1st or 2nd:  0: {c[0]:8.6f} {2*p[0]:8.6f}  1: {c[1]:8.6f} {2*p[1]:8.6f}  2: {c[2]:8.6f} {2*p[2]:8.6f}")
    print()
    print("2nd column conditioned on 1st column [value 1st: val / prob 2nd]")
    for i in range(3):
        idx = np.flatnonzero(sol[:, 0]==i)
        c = np.bincount(sol[idx, 1], minlength=3) / len(idx)
        print(f"{i}: 0 / {c[0]:8.6f} 1 / {c[1]:8.6f} 2 / {c[2]:8.6f}")
    print()

# demo
p = 0.4, 0.35, 0.25
n = 1000000
print("Method: Naive")
check_sol(p, f_naive(n//10, p))
print("Method: naive, optimized")
check_sol(p, f_opt(n, p))
print("Method: Timeless")
check_sol(p, f_pairs(n, p))

Sample output:

Method: Naive
Frequencies [value: observed, desired]
1st column:  0: 0.449330 0.400000  1: 0.334180 0.350000  2: 0.216490 0.250000
2nd column:  0: 0.349050 0.400000  1: 0.366640 0.350000  2: 0.284310 0.250000
1st or 2nd:  0: 0.798380 0.800000  1: 0.700820 0.700000  2: 0.500800 0.500000

2nd column conditioned on 1st column [value 1st: val / prob 2nd]
0: 0 / 0.000000 1 / 0.608128 2 / 0.391872
1: 0 / 0.676133 1 / 0.000000 2 / 0.323867
2: 0 / 0.568617 1 / 0.431383 2 / 0.000000

Method: naive, optimized
Frequencies [value: observed, desired]
1st column:  0: 0.450606 0.400000  1: 0.334881 0.350000  2: 0.214513 0.250000
2nd column:  0: 0.349624 0.400000  1: 0.365469 0.350000  2: 0.284907 0.250000
1st or 2nd:  0: 0.800230 0.800000  1: 0.700350 0.700000  2: 0.499420 0.500000

2nd column conditioned on 1st column [value 1st: val / prob 2nd]
0: 0 / 0.000000 1 / 0.608132 2 / 0.391868
1: 0 / 0.676515 1 / 0.000000 2 / 0.323485
2: 0 / 0.573727 1 / 0.426273 2 / 0.000000

Method: Timeless
Frequencies [value: observed, desired]
1st column:  0: 0.400756 0.400000  1: 0.349099 0.350000  2: 0.250145 0.250000
2nd column:  0: 0.399128 0.400000  1: 0.351298 0.350000  2: 0.249574 0.250000
1st or 2nd:  0: 0.799884 0.800000  1: 0.700397 0.700000  2: 0.499719 0.500000

2nd column conditioned on 1st column [value 1st: val / prob 2nd]
0: 0 / 0.000000 1 / 0.625747 2 / 0.374253
1: 0 / 0.714723 1 / 0.000000 2 / 0.285277
2: 0 / 0.598129 1 / 0.401871 2 / 0.000000
Answered By: Paul Panzer
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.