Create large random boolean matrix with numpy

Question:

I am trying to create a huge boolean matrix which is randomly filled with True and False with a given probability p. At first I used this code:

N = 30000
p = 0.1
np.random.choice(a=[False, True], size=(N, N), p=[p, 1-p])  

But sadly it does not seem to terminate for this big N. So I tried to split it up into the generation of the single rows by doing this:

N = 30000
p = 0.1
mask = np.empty((N, N))
for i in range (N):
     mask[i] = np.random.choice(a=[False, True], size=N, p=[p, 1-p])            
     if (i % 100 == 0):
          print(i)

Now, there happens something strange (at least on my device): The first ~1100 rows are very fastly generated – but after it, the code becomes horribly slow. Why is this happening? What do I miss here? Are there better ways to create a big matrix which has True entries with probability p and False entries with probability 1-p?

Edit: As many of you assumed that the RAM will be a problem: As the device which will run the code has almost 500GB RAM, this won’t be a problem.

Asked By: zimmerrol

||

Answers:

The problem is your RAM, the values are being stored in memory as it’s being created. I just created this matrix using this command:

np.random.choice(a=[False, True], size=(N, N), p=[p, 1-p])

I used an AWS i3 instance with 64GB of RAM and 8 cores. To create this matrix, htop shows that it takes up ~20GB of RAM. Here is a benchmark in case you care:

time np.random.choice(a=[False, True], size=(N, N), p=[p, 1-p])

CPU times: user 18.3 s, sys: 3.4 s, total: 21.7 s
Wall time: 21.7 s


 def mask_method(N, p):
    for i in range(N):
        mask[i] = np.random.choice(a=[False, True], size=N, p=[p, 1-p])
        if (i % 100 == 0):
            print(i)

time mask_method(N,p)

CPU times: user 20.9 s, sys: 1.55 s, total: 22.5 s
Wall time: 22.5 s

Note that the mask method only takes up ~9GB of RAM at it’s peak.

Edit: The first method flushes the RAM after the process is done where as the function method retains all of it.

Answered By: gold_cy

Another possibility could be to generate it in a batch (i.e. compute many sub-arrays and stack them together at the very end). But, consider not to update one array (mask) in a for loop as OP is doing. This would force the whole array to load in main memory during every indexing update.

Instead for example: to get 30000x30000, have 9000 100x100 separate arrays, update each of this 100x100 array accordingly in a for loop and finally stack these 9000 arrays together in a giant array. This would definitely need not more than 4GB of RAM and would be very fast as well.

Minimal Example:

In [9]: a
Out[9]: 
array([[0, 1],
       [2, 3]])

In [10]: np.hstack([np.vstack([a]*5)]*5)
Out[10]: 
array([[0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
       [2, 3, 2, 3, 2, 3, 2, 3, 2, 3],
       [0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
       [2, 3, 2, 3, 2, 3, 2, 3, 2, 3],
       [0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
       [2, 3, 2, 3, 2, 3, 2, 3, 2, 3],
       [0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
       [2, 3, 2, 3, 2, 3, 2, 3, 2, 3],
       [0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
       [2, 3, 2, 3, 2, 3, 2, 3, 2, 3]])

In [11]: np.hstack([np.vstack([a]*5)]*5).shape
Out[11]: (10, 10)
Answered By: kmario23

So I tried to split it up into the generation of the single rows by doing this:

The way that np.random.choice works is by first generating a float64 in [0, 1) for every cell of your data, and then converting that into an index in your array using np.search_sorted. This intermediate representation is 8 times larger than the boolean array!

Since your data is boolean, you can get a factor of two speedup with

np.random.rand(N, N) > p

Which naturally, you could use inside your looping solution

It seems like np.random.choice could do with some buffering here – you might want to file an issue against numpy.

Another option would be to try and generate float32s instead of float64s. I’m not sure if numpy can do that right now, but you could request the feature.

Answered By: Eric

Really surprised no one has mentioned this solution yet..

This line

np.random.choice(a=[False, True], size=(N, N), p=[p, 1-p])

runs NXN Bernoulli Trials. (In your case, 900M of them!) A Bernoulli trial is just a random experiment with two possible outcomes, with probabilities p and 1-p.

The sum of N Bernoulli trials, each with probability p, can be modeled by the Binomial distribution.

We can leverage this fact to randomly simulate the total count of True elements. With NumPy,

import numpy as np

N = 30000
p = 0.1

# Build a random number generator
rng = np.random.default_rng(123)

# Randomly determine the total number of True values
Ntrue = rng.binomial(n=N*N, p=p, size=1)[0]  # 90016776

Now we can randomly determine the position of each True element by randomly choosing row and col indices without replacement.

# Randomly determine true position
position_ids = rng.choice(a=N*N, size=Ntrue, replace=False)
positions = np.unravel_index(position_ids, shape=(N,N))

And now we can populate a compressed sparse row (CSR) matrix.

from scipy import sparse

# Build a compressed sparse row matrix with the constructor:
# csr_matrix((data, (row_ind, col_ind)), [shape=(M, N)])
result = sparse.csr_matrix((np.ones(shape=Ntrue), positions), shape=(N,N))

Notice this solution avoids storing and computing 900M boolean values.

Funny enough, I wrote about a nearly identical problem before stumbling upon this question.

Answered By: Ben
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.