Memory-efficient way to generate a large numpy array containing random boolean values Memory-efficient way to generate a large numpy array containing random boolean values numpy numpy

Memory-efficient way to generate a large numpy array containing random boolean values


One problem with using np.random.randint is that it generates 64-bit integers, whereas numpy's np.bool dtype uses only 8 bits to represent each boolean value. You are therefore allocating an intermediate array 8x larger than necessary.

A workaround that avoids intermediate 64-bit dtypes is to generate a string of random bytes using np.random.bytes, which can be converted to an array of 8-bit integers using np.fromstring. These integers can then be converted to boolean values, for example by testing whether they are less than 255 * p, where p is the desired probability of each element being True:

import numpy as npdef random_bool(shape, p=0.5):    n = np.prod(shape)    x = np.fromstring(np.random.bytes(n), np.uint8, n)    return (x < 255 * p).reshape(shape)

Benchmark:

In [1]: shape = 1200, int(2E6)In [2]: %timeit random_bool(shape)1 loops, best of 3: 12.7 s per loop

One important caveat is that the probability will be rounded down to the nearest multiple of 1/256 (for an exact multiple of 1/256 such as p=1/2 this should not affect accuracy).


Update:

An even faster method is to exploit the fact that you only need to generate a single random bit per 0 or 1 in your output array. You can therefore create a random array of 8-bit integers 1/8th the size of the final output, then convert it to np.bool using np.unpackbits:

def fast_random_bool(shape):    n = np.prod(shape)    nb = -(-n // 8)     # ceiling division    b = np.fromstring(np.random.bytes(nb), np.uint8, nb)    return np.unpackbits(b)[:n].reshape(shape).view(np.bool)

For example:

In [3]: %timeit fast_random_bool(shape)1 loops, best of 3: 5.54 s per loop