How to use random.shuffle() on a generator? python
In order to shuffle the sequence uniformly, random.shuffle()
needs to know how long the input is. A generator cannot provide this; you have to materialize it into a list:
lst = list(yielding(x))random.shuffle(lst)for i in lst: print i
You could, instead, use sorted()
with random.random()
as the key:
for i in sorted(yielding(x), key=lambda k: random.random()): print i
but since this also produces a list, there is little point in going this route.
Demo:
>>> import random>>> x = [1,2,3,4,5,6,7,8,9]>>> sorted(iter(x), key=lambda k: random.random())[9, 7, 3, 2, 5, 4, 6, 1, 8]
It's not possible to randomize the yield of a generator without temporarily saving all the elements somewhere. Luckily, this is pretty easy in Python:
tmp = list(yielding(x))random.shuffle(tmp)for i in tmp: print i
Note the call to list()
which will read all items and put them into a list.
If you don't want to or can't store all elements, you will need to change the generator to yield in a random order.
Depending on the case, if you know how much data you have ahead of time, you can index the data and compute/read from it based on a shuffled index. This amounts to: 'don't use a generator for this problem', and without specific use-cases it's hard to come up with a general method.
Alternatively... If you need to use the generator...
it depends on 'how shuffled' you want the data. Of course, like folks have pointed out, generators don't have a length, so you need to at some point evaluate the generator, which could be expensive. If you don't need perfect randomness, you can introduce a shuffle buffer:
from itertools import isliceimport numpy as npdef shuffle(generator, buffer_size): while True: buffer = list(islice(generator, buffer_size)) if len(buffer) == 0: break np.random.shuffle(buffer) for item in buffer: yield itemshuffled_generator = shuffle(my_generator, 256)
This will shuffle data in chunks of buffer_size
, so you can avoid memory issues if that is your limiting factor. Of course, this is not a truly random shuffle, so it shouldn't be used on something that's sorted, but if you just need to add some randomness to your data this may be a good solution.