Sampling Without Replacement Probabilities Sampling Without Replacement Probabilities numpy numpy

Sampling Without Replacement Probabilities


I'm giving two interpretations of the problem. One I prefer ("Timeless") and one I consider technically valid but inferior ("Naive")

Timeless:

Given probabilities x, y, z this approach computes x', y', z' such that if we draw twice independently and discard all equal pairs the frequencies of 0, 1, 2 are x, y, z.

This gives the right total frequencies over both trials and has the added benefit of being simple and being timeless in the sense that first and second trial are equivalent.

For this to hold we must have

(x'y' + x'z') / [2 (x'y' + x'z' + y'z')] = x(x'y' + y'z') / [2 (x'y' + x'z' + y'z')] = y                         (1)(y'z' + x'z') / [2 (x'y' + x'z' + y'z')] = z

If we add two of those and subtract the third we get

x'y' / (x'y' + x'z' + y'z') =  x + y - z = 1 - 2 zx'z' / (x'y' + x'z' + y'z') =  x - y + z = 1 - 2 y                   (2)y'z' / (x'y' + x'z' + y'z') = -x + y + z = 1 - 2 x

Multiplying 2 of those and dividing by the third

x'^2 / (x'y' + x'z' + y'z') = (1 - 2 z) (1 - 2 y) / (1 - 2 x)y'^2 / (x'y' + x'z' + y'z') = (1 - 2 z) (1 - 2 x) / (1 - 2 y)        (3)z'^2 / (x'y' + x'z' + y'z') = (1 - 2 x) (1 - 2 y) / (1 - 2 z)

Therefore up to a constant factor

x' ~ sqrt[(1 - 2 z) (1 - 2 y) / (1 - 2 x)]y' ~ sqrt[(1 - 2 z) (1 - 2 x) / (1 - 2 y)]                           (4)z' ~ sqrt[(1 - 2 x) (1 - 2 y) / (1 - 2 z)]

Since we know that x', y', z' must sum to one this is enough to solve.

But: we needn't actually completely solve for x', y', z'. Since we are only interested in unequal pairs, all we need are the conditional probabilities x'y' / (x'y' + x'z' + y'z'), x'z' / (x'y' + x'z' + y'z') and y'z' / (x'y' + x'z' + y'z'). These we can compute using equation (2).

We then halve each of them to get the probabilities for ordered pairs and draw from the six legal pairs with these probabilities.

Naive:

This is based on the (arbitrary in my opinion) postulate that after the first draw with probability x', y', z', the second must have conditional probability 0, y' / (y'+z'), z' / (y'+z') if first was 0 x' / (x'+z'), 0, z' / (x'+z') if first was 1 and probability x' / (x'+y'), y' / (x'+y'), 0) if first was 2.

This has the disadvantage that as far as I can tell there is no simple, closed-form solution and the second and first draws are quite different.

The advantage is that one can use it directly with np.random.choice; this is, however, so slow that in the implementation below I give a workaround that avoids this function.

After some algebra one finds:

1/x' - x' = c (1 - 2x)1/y' - y' = c (1 - 2y)1/z' - z' = c (1 - 2z)

where c = 1/x' + 1/y' + 1/z' - 1. This I only managed to solve numerically.

Implementation and results:

And here is the implementation.

import numpy as npfrom scipy import optimizedef f_pairs(n, p):    p = np.asanyarray(p)    p /= p.sum()    assert np.all(p <= 0.5)    pp = 1 - 2*p    # the following two lines show how to compute x', y', z'    # pp = np.sqrt(pp.prod()) / pp    # pp /= pp.sum()    # now pp contains x', y', z'    i, j = np.triu_indices(3, 1)    i, j = i[::-1], j[::-1]    pairs = np.c_[np.r_[i, j], np.r_[j, i]]    pp6 = np.r_[pp/2, pp/2]    return pairs[np.random.choice(6, size=(n,), replace=True, p=pp6)]def f_opt(n, p):    p = np.asanyarray(p)    p /= p.sum()    pp = 1 - 2*p    def target(l):        lp2 = l*pp/2        return (np.sqrt(1 + lp2**2) - lp2).sum() - 1    l = optimize.root(target, 8).x    lp2 = l*pp/2    pp = np.sqrt(1 + lp2**2) - lp2    fst = np.random.choice(3, size=(n,), replace=True, p=pp)    snd = (        (np.random.random((n,)) < (1 / (1 + (pp[(fst+1)%3] / pp[(fst-1)%3]))))        + fst + 1) % 3    return np.c_[fst, snd]def f_naive(n, p):    p = np.asanyarray(p)    p /= p.sum()    pp = 1 - 2*p    def target(l):        lp2 = l*pp/2        return (np.sqrt(1 + lp2**2) - lp2).sum() - 1    l = optimize.root(target, 8).x    lp2 = l*pp/2    pp = np.sqrt(1 + lp2**2) - lp2    return np.array([np.random.choice(3, (2,), replace=False, p=pp)                    for _ in range(n)])def check_sol(p, sol):    N = len(sol)    print("Frequencies [value: observed, desired]")    c1 = np.bincount(sol[:, 0], minlength=3) / N    print(f"1st column:  0: {c1[0]:8.6f} {p[0]:8.6f}  1: {c1[1]:8.6f} {p[1]:8.6f}  2: {c1[2]:8.6f} {p[2]:8.6f}")    c2 = np.bincount(sol[:, 1], minlength=3) / N    print(f"2nd column:  0: {c2[0]:8.6f} {p[0]:8.6f}  1: {c2[1]:8.6f} {p[1]:8.6f}  2: {c2[2]:8.6f} {p[2]:8.6f}")    c = c1 + c2    print(f"1st or 2nd:  0: {c[0]:8.6f} {2*p[0]:8.6f}  1: {c[1]:8.6f} {2*p[1]:8.6f}  2: {c[2]:8.6f} {2*p[2]:8.6f}")    print()    print("2nd column conditioned on 1st column [value 1st: val / prob 2nd]")    for i in range(3):        idx = np.flatnonzero(sol[:, 0]==i)        c = np.bincount(sol[idx, 1], minlength=3) / len(idx)        print(f"{i}: 0 / {c[0]:8.6f} 1 / {c[1]:8.6f} 2 / {c[2]:8.6f}")    print()# demop = 0.4, 0.35, 0.25n = 1000000print("Method: Naive")check_sol(p, f_naive(n//10, p))print("Method: naive, optimized")check_sol(p, f_opt(n, p))print("Method: Timeless")check_sol(p, f_pairs(n, p))

Sample output:

Method: NaiveFrequencies [value: observed, desired]1st column:  0: 0.449330 0.400000  1: 0.334180 0.350000  2: 0.216490 0.2500002nd column:  0: 0.349050 0.400000  1: 0.366640 0.350000  2: 0.284310 0.2500001st or 2nd:  0: 0.798380 0.800000  1: 0.700820 0.700000  2: 0.500800 0.5000002nd column conditioned on 1st column [value 1st: val / prob 2nd]0: 0 / 0.000000 1 / 0.608128 2 / 0.3918721: 0 / 0.676133 1 / 0.000000 2 / 0.3238672: 0 / 0.568617 1 / 0.431383 2 / 0.000000Method: naive, optimizedFrequencies [value: observed, desired]1st column:  0: 0.450606 0.400000  1: 0.334881 0.350000  2: 0.214513 0.2500002nd column:  0: 0.349624 0.400000  1: 0.365469 0.350000  2: 0.284907 0.2500001st or 2nd:  0: 0.800230 0.800000  1: 0.700350 0.700000  2: 0.499420 0.5000002nd column conditioned on 1st column [value 1st: val / prob 2nd]0: 0 / 0.000000 1 / 0.608132 2 / 0.3918681: 0 / 0.676515 1 / 0.000000 2 / 0.3234852: 0 / 0.573727 1 / 0.426273 2 / 0.000000Method: TimelessFrequencies [value: observed, desired]1st column:  0: 0.400756 0.400000  1: 0.349099 0.350000  2: 0.250145 0.2500002nd column:  0: 0.399128 0.400000  1: 0.351298 0.350000  2: 0.249574 0.2500001st or 2nd:  0: 0.799884 0.800000  1: 0.700397 0.700000  2: 0.499719 0.5000002nd column conditioned on 1st column [value 1st: val / prob 2nd]0: 0 / 0.000000 1 / 0.625747 2 / 0.3742531: 0 / 0.714723 1 / 0.000000 2 / 0.2852772: 0 / 0.598129 1 / 0.401871 2 / 0.000000