How to incrementally sample without replacement? How to incrementally sample without replacement? python python

How to incrementally sample without replacement?


If you know in advance that you're going to want to multiple samples without overlaps, easiest is to do random.shuffle() on list(range(100)) (Python 3 - can skip the list() in Python 2), then peel off slices as needed.

s = list(range(100))random.shuffle(s)first_sample = s[-10:]del s[-10:]second_sample = s[-10:]del s[-10:]# etc

Else @Chronial's answer is reasonably efficient.


Note to readers from OP: Please consider looking at the originally accepted answer to understand the logic, and then understand this answer.

Aaaaaand for completeness sake: This is the concept of necromancer’s answer, but adapted so it takes a list of forbidden numbers as input. This is just the same code as in my previous answer, but we build a state from forbid, before we generate numbers.

  • This is time O(f+k) and memory O(f+k). Obviously this is the fastest thing possible without requirements towards the format of forbid (sorted/set). I think this makes this a winner in some way ^^.
  • If forbid is a set, the repeated guessing method is faster with O(k⋅n/(n-(f+k))), which is very close to O(k) for f+k not very close to n.
  • If forbid is sorted, my ridiculous algorithm is faster with:
    O(k⋅(log(f+k)+log²(n/(n-(f+k))))
import randomdef sample_gen(n, forbid):    state = dict()    track = dict()    for (i, o) in enumerate(forbid):        x = track.get(o, o)        t = state.get(n-i-1, n-i-1)        state[x] = t        track[t] = x        state.pop(n-i-1, None)        track.pop(o, None)    del track    for remaining in xrange(n-len(forbid), 0, -1):        i = random.randrange(remaining)        yield state.get(i, i)        state[i] = state.get(remaining - 1, remaining - 1)        state.pop(remaining - 1, None)

usage:

gen = sample_gen(10, [1, 2, 4, 8])print gen.next()print gen.next()print gen.next()print gen.next()


Ok, here we go. This should be the fastest possible non-probabilistic algorithm. It has runtime of O(k⋅log²(s) + f⋅log(f)) ⊂ O(k⋅log²(f+k) + f⋅log(f))) and space O(k+f). f is the amount of forbidden numbers, s is the length of the longest streak of forbidden numbers. The expectation for that is more complicated, but obviously bound by f. If you assume that s^log₂(s) is bigger than f or are just unhappy about the fact that s is once again probabilistic, you can change the log part to a bisection search in forbidden[pos:] to get O(k⋅log(f+k) + f⋅log(f)).

The actual implementation here is O(k⋅(k+f)+f⋅log(f)), as insertion in the list forbid is O(n). This is easy to fix by replacing that list with a blist sortedlist.

I also added some comments, because this algorithm is ridiculously complex. The lin part does the same as the log part, but needs s instead of log²(s) time.

import bisectimport randomdef sample(k, end, forbid):    forbidden = sorted(forbid)    out = []    # remove the last block from forbidden if it touches end    for end in reversed(xrange(end+1)):        if len(forbidden) > 0 and forbidden[-1] == end:            del forbidden[-1]        else:            break    for i in xrange(k):        v = random.randrange(end - len(forbidden) + 1)        # increase v by the number of values < v        pos = bisect.bisect(forbidden, v)        v += pos        # this number might also be already taken, find the        # first free spot        ##### linear        #while pos < len(forbidden) and forbidden[pos] <=v:        #    pos += 1        #    v += 1        ##### log        while pos < len(forbidden) and forbidden[pos] <= v:            step = 2            # when this is finished, we know that:            # • forbidden[pos + step/2] <= v + step/2            # • forbidden[pos + step]   >  v + step            # so repeat until (checked by outer loop):            #   forbidden[pos + step/2] == v + step/2            while (pos + step <= len(forbidden)) and \                  (forbidden[pos + step - 1] <= v + step - 1):                step = step << 1            pos += step >> 1            v += step >> 1        if v == end:            end -= 1        else:            bisect.insort(forbidden, v)        out.append(v)    return out

Now to compare that to the “hack” (and the default implementation in python) that Veedrac proposed, which has space O(f+k) and (n/(n-(f+k)) is the expected number of “guesses”) time:

O(f+k*(n/(n-(f+k)))

I just plotted this for k=10 and a reasonably big n=10000 (it only gets more extreme for bigger n). And I have to say: I only implemented this because it seemed like a fun challenge, but even I am surprised by how extreme this is:

enter image description here

Let’s zoom in to see what’s going on:

enter image description here

Yes – the guesses are even faster for the 9998th number you generate. Note that, as you can see in the first plot, even my one-liner is probably faster for bigger f/n (but still has rather horrible space requirements for big n).

To drive the point home: The only thing you are spending time on here is generating the set, as that’s the f factor in Veedrac’s method.

enter image description here

So I hope my time here was not wasted and I managed to convince you that Veedrac’s method is simply the way to go. I can kind of understand why that probabilistic part troubles you, but maybe think of the fact that hashmaps (= python dicts) and tons of other algorithms work with similar methods and they seem to be doing just fine.

You might be afraid of the variance in the number of repetitions. As noted above, this follows a geometric distribution with p=n-f/n. So the standard deviation (=the amount you “should expect” the result to deviate from the expected average) is

enter image description here

Which is basically the same as the mean (√f⋅n < √n² = n).

****edit**:
I just realized that s is actually also n/(n-(f+k)). So a more exact runtime for my algorithm is O(k⋅log²(n/(n-(f+k))) + f⋅log(f)). Which is nice since given the graphs above, it proves my intuition that that is quite a bit faster than O(k⋅log(f+k) + f⋅log(f)). But rest assured that that also does not change anything about the results above, as the f⋅log(f) is the absolutely dominant part in the runtime.