Create large random boolean matrix with numpy Create large random boolean matrix with numpy numpy numpy

Create large random boolean matrix with numpy


The problem is your RAM, the values are being stored in memory as it's being created. I just created this matrix using this command:

np.random.choice(a=[False, True], size=(N, N), p=[p, 1-p])

I used an AWS i3 instance with 64GB of RAM and 8 cores. To create this matrix, htop shows that it takes up ~20GB of RAM. Here is a benchmark in case you care:

time np.random.choice(a=[False, True], size=(N, N), p=[p, 1-p])CPU times: user 18.3 s, sys: 3.4 s, total: 21.7 sWall time: 21.7 s def mask_method(N, p):    for i in range(N):        mask[i] = np.random.choice(a=[False, True], size=N, p=[p, 1-p])        if (i % 100 == 0):            print(i)time mask_method(N,p)CPU times: user 20.9 s, sys: 1.55 s, total: 22.5 sWall time: 22.5 s

Note that the mask method only takes up ~9GB of RAM at it's peak.

Edit: The first method flushes the RAM after the process is done where as the function method retains all of it.


So I tried to split it up into the generation of the single rows by doing this:

The way that np.random.choice works is by first generating a float64 in [0, 1) for every cell of your data, and then converting that into an index in your array using np.search_sorted. This intermediate representation is 8 times larger than the boolean array!

Since your data is boolean, you can get a factor of two speedup with

np.random.rand(N, N) > p

Which naturally, you could use inside your looping solution

It seems like np.random.choice could do with some buffering here - you might want to file an issue against numpy.

Another option would be to try and generate float32s instead of float64s. I'm not sure if numpy can do that right now, but you could request the feature.


Another possibility could be to generate it in a batch (i.e. compute many sub-arrays and stack them together at the very end). But, consider not to update one array (mask) in a for loop as OP is doing. This would force the whole array to load in main memory during every indexing update.

Instead for example: to get 30000x30000, have 9000 100x100 separate arrays, update each of this 100x100 array accordingly in a for loop and finally stack these 9000 arrays together in a giant array. This would definitely need not more than 4GB of RAM and would be very fast as well.

Minimal Example:

In [9]: aOut[9]: array([[0, 1],       [2, 3]])In [10]: np.hstack([np.vstack([a]*5)]*5)Out[10]: array([[0, 1, 0, 1, 0, 1, 0, 1, 0, 1],       [2, 3, 2, 3, 2, 3, 2, 3, 2, 3],       [0, 1, 0, 1, 0, 1, 0, 1, 0, 1],       [2, 3, 2, 3, 2, 3, 2, 3, 2, 3],       [0, 1, 0, 1, 0, 1, 0, 1, 0, 1],       [2, 3, 2, 3, 2, 3, 2, 3, 2, 3],       [0, 1, 0, 1, 0, 1, 0, 1, 0, 1],       [2, 3, 2, 3, 2, 3, 2, 3, 2, 3],       [0, 1, 0, 1, 0, 1, 0, 1, 0, 1],       [2, 3, 2, 3, 2, 3, 2, 3, 2, 3]])In [11]: np.hstack([np.vstack([a]*5)]*5).shapeOut[11]: (10, 10)