How to use random.shuffle() on a generator? python How to use random.shuffle() on a generator? python python python

How to use random.shuffle() on a generator? python


In order to shuffle the sequence uniformly, random.shuffle() needs to know how long the input is. A generator cannot provide this; you have to materialize it into a list:

lst = list(yielding(x))random.shuffle(lst)for i in lst:    print i

You could, instead, use sorted() with random.random() as the key:

for i in sorted(yielding(x), key=lambda k: random.random()):    print i

but since this also produces a list, there is little point in going this route.

Demo:

>>> import random>>> x = [1,2,3,4,5,6,7,8,9]>>> sorted(iter(x), key=lambda k: random.random())[9, 7, 3, 2, 5, 4, 6, 1, 8]


It's not possible to randomize the yield of a generator without temporarily saving all the elements somewhere. Luckily, this is pretty easy in Python:

tmp = list(yielding(x))random.shuffle(tmp)for i in tmp:    print i

Note the call to list() which will read all items and put them into a list.

If you don't want to or can't store all elements, you will need to change the generator to yield in a random order.


Depending on the case, if you know how much data you have ahead of time, you can index the data and compute/read from it based on a shuffled index. This amounts to: 'don't use a generator for this problem', and without specific use-cases it's hard to come up with a general method.

Alternatively... If you need to use the generator...

it depends on 'how shuffled' you want the data. Of course, like folks have pointed out, generators don't have a length, so you need to at some point evaluate the generator, which could be expensive. If you don't need perfect randomness, you can introduce a shuffle buffer:

from itertools import isliceimport numpy as npdef shuffle(generator, buffer_size):    while True:        buffer = list(islice(generator, buffer_size))        if len(buffer) == 0:            break        np.random.shuffle(buffer)        for item in buffer:            yield itemshuffled_generator = shuffle(my_generator, 256)

This will shuffle data in chunks of buffer_size, so you can avoid memory issues if that is your limiting factor. Of course, this is not a truly random shuffle, so it shouldn't be used on something that's sorted, but if you just need to add some randomness to your data this may be a good solution.