Create large random boolean matrix with numpy
Question:
I am trying to create a huge boolean
matrix which is randomly filled with True
and False
with a given probability p
. At first I used this code:
N = 30000
p = 0.1
np.random.choice(a=[False, True], size=(N, N), p=[p, 1-p])
But sadly it does not seem to terminate for this big N
. So I tried to split it up into the generation of the single rows by doing this:
N = 30000
p = 0.1
mask = np.empty((N, N))
for i in range (N):
mask[i] = np.random.choice(a=[False, True], size=N, p=[p, 1-p])
if (i % 100 == 0):
print(i)
Now, there happens something strange (at least on my device): The first ~1100 rows are very fastly generated – but after it, the code becomes horribly slow. Why is this happening? What do I miss here? Are there better ways to create a big matrix which has True
entries with probability p
and False
entries with probability 1-p
?
Edit: As many of you assumed that the RAM will be a problem: As the device which will run the code has almost 500GB RAM, this won’t be a problem.
Answers:
The problem is your RAM, the values are being stored in memory as it’s being created. I just created this matrix using this command:
np.random.choice(a=[False, True], size=(N, N), p=[p, 1-p])
I used an AWS i3
instance with 64GB of RAM and 8 cores. To create this matrix, htop
shows that it takes up ~20GB of RAM. Here is a benchmark in case you care:
time np.random.choice(a=[False, True], size=(N, N), p=[p, 1-p])
CPU times: user 18.3 s, sys: 3.4 s, total: 21.7 s
Wall time: 21.7 s
def mask_method(N, p):
for i in range(N):
mask[i] = np.random.choice(a=[False, True], size=N, p=[p, 1-p])
if (i % 100 == 0):
print(i)
time mask_method(N,p)
CPU times: user 20.9 s, sys: 1.55 s, total: 22.5 s
Wall time: 22.5 s
Note that the mask method only takes up ~9GB of RAM at it’s peak.
Edit: The first method flushes the RAM after the process is done where as the function method retains all of it.
Another possibility could be to generate it in a batch (i.e. compute many sub-arrays and stack them together at the very end). But, consider not to update one array (mask
) in a for
loop as OP is doing. This would force the whole array to load in main memory during every indexing update.
Instead for example: to get 30000x30000
, have 9000 100x100
separate arrays, update each of this 100x100
array accordingly in a for
loop and finally stack these 9000 arrays together in a giant array. This would definitely need not more than 4GB of RAM and would be very fast as well.
Minimal Example:
In [9]: a
Out[9]:
array([[0, 1],
[2, 3]])
In [10]: np.hstack([np.vstack([a]*5)]*5)
Out[10]:
array([[0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
[2, 3, 2, 3, 2, 3, 2, 3, 2, 3],
[0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
[2, 3, 2, 3, 2, 3, 2, 3, 2, 3],
[0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
[2, 3, 2, 3, 2, 3, 2, 3, 2, 3],
[0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
[2, 3, 2, 3, 2, 3, 2, 3, 2, 3],
[0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
[2, 3, 2, 3, 2, 3, 2, 3, 2, 3]])
In [11]: np.hstack([np.vstack([a]*5)]*5).shape
Out[11]: (10, 10)
So I tried to split it up into the generation of the single rows by doing this:
The way that np.random.choice
works is by first generating a float64
in [0, 1)
for every cell of your data, and then converting that into an index in your array using np.search_sorted
. This intermediate representation is 8 times larger than the boolean array!
Since your data is boolean, you can get a factor of two speedup with
np.random.rand(N, N) > p
Which naturally, you could use inside your looping solution
It seems like np.random.choice
could do with some buffering here – you might want to file an issue against numpy.
Another option would be to try and generate float32
s instead of float64
s. I’m not sure if numpy can do that right now, but you could request the feature.
Really surprised no one has mentioned this solution yet..
This line
np.random.choice(a=[False, True], size=(N, N), p=[p, 1-p])
runs NXN Bernoulli Trials. (In your case, 900M of them!) A Bernoulli trial is just a random experiment with two possible outcomes, with probabilities p and 1-p.
The sum of N Bernoulli trials, each with probability p, can be modeled by the Binomial distribution.
We can leverage this fact to randomly simulate the total count of True elements. With NumPy,
import numpy as np
N = 30000
p = 0.1
# Build a random number generator
rng = np.random.default_rng(123)
# Randomly determine the total number of True values
Ntrue = rng.binomial(n=N*N, p=p, size=1)[0] # 90016776
Now we can randomly determine the position of each True element by randomly choosing row and col indices without replacement.
# Randomly determine true position
position_ids = rng.choice(a=N*N, size=Ntrue, replace=False)
positions = np.unravel_index(position_ids, shape=(N,N))
And now we can populate a compressed sparse row (CSR) matrix.
from scipy import sparse
# Build a compressed sparse row matrix with the constructor:
# csr_matrix((data, (row_ind, col_ind)), [shape=(M, N)])
result = sparse.csr_matrix((np.ones(shape=Ntrue), positions), shape=(N,N))
Notice this solution avoids storing and computing 900M boolean values.
Funny enough, I wrote about a nearly identical problem before stumbling upon this question.
I am trying to create a huge boolean
matrix which is randomly filled with True
and False
with a given probability p
. At first I used this code:
N = 30000
p = 0.1
np.random.choice(a=[False, True], size=(N, N), p=[p, 1-p])
But sadly it does not seem to terminate for this big N
. So I tried to split it up into the generation of the single rows by doing this:
N = 30000
p = 0.1
mask = np.empty((N, N))
for i in range (N):
mask[i] = np.random.choice(a=[False, True], size=N, p=[p, 1-p])
if (i % 100 == 0):
print(i)
Now, there happens something strange (at least on my device): The first ~1100 rows are very fastly generated – but after it, the code becomes horribly slow. Why is this happening? What do I miss here? Are there better ways to create a big matrix which has True
entries with probability p
and False
entries with probability 1-p
?
Edit: As many of you assumed that the RAM will be a problem: As the device which will run the code has almost 500GB RAM, this won’t be a problem.
The problem is your RAM, the values are being stored in memory as it’s being created. I just created this matrix using this command:
np.random.choice(a=[False, True], size=(N, N), p=[p, 1-p])
I used an AWS i3
instance with 64GB of RAM and 8 cores. To create this matrix, htop
shows that it takes up ~20GB of RAM. Here is a benchmark in case you care:
time np.random.choice(a=[False, True], size=(N, N), p=[p, 1-p])
CPU times: user 18.3 s, sys: 3.4 s, total: 21.7 s
Wall time: 21.7 s
def mask_method(N, p):
for i in range(N):
mask[i] = np.random.choice(a=[False, True], size=N, p=[p, 1-p])
if (i % 100 == 0):
print(i)
time mask_method(N,p)
CPU times: user 20.9 s, sys: 1.55 s, total: 22.5 s
Wall time: 22.5 s
Note that the mask method only takes up ~9GB of RAM at it’s peak.
Edit: The first method flushes the RAM after the process is done where as the function method retains all of it.
Another possibility could be to generate it in a batch (i.e. compute many sub-arrays and stack them together at the very end). But, consider not to update one array (mask
) in a for
loop as OP is doing. This would force the whole array to load in main memory during every indexing update.
Instead for example: to get 30000x30000
, have 9000 100x100
separate arrays, update each of this 100x100
array accordingly in a for
loop and finally stack these 9000 arrays together in a giant array. This would definitely need not more than 4GB of RAM and would be very fast as well.
Minimal Example:
In [9]: a
Out[9]:
array([[0, 1],
[2, 3]])
In [10]: np.hstack([np.vstack([a]*5)]*5)
Out[10]:
array([[0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
[2, 3, 2, 3, 2, 3, 2, 3, 2, 3],
[0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
[2, 3, 2, 3, 2, 3, 2, 3, 2, 3],
[0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
[2, 3, 2, 3, 2, 3, 2, 3, 2, 3],
[0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
[2, 3, 2, 3, 2, 3, 2, 3, 2, 3],
[0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
[2, 3, 2, 3, 2, 3, 2, 3, 2, 3]])
In [11]: np.hstack([np.vstack([a]*5)]*5).shape
Out[11]: (10, 10)
So I tried to split it up into the generation of the single rows by doing this:
The way that np.random.choice
works is by first generating a float64
in [0, 1)
for every cell of your data, and then converting that into an index in your array using np.search_sorted
. This intermediate representation is 8 times larger than the boolean array!
Since your data is boolean, you can get a factor of two speedup with
np.random.rand(N, N) > p
Which naturally, you could use inside your looping solution
It seems like np.random.choice
could do with some buffering here – you might want to file an issue against numpy.
Another option would be to try and generate float32
s instead of float64
s. I’m not sure if numpy can do that right now, but you could request the feature.
Really surprised no one has mentioned this solution yet..
This line
np.random.choice(a=[False, True], size=(N, N), p=[p, 1-p])
runs NXN Bernoulli Trials. (In your case, 900M of them!) A Bernoulli trial is just a random experiment with two possible outcomes, with probabilities p and 1-p.
The sum of N Bernoulli trials, each with probability p, can be modeled by the Binomial distribution.
We can leverage this fact to randomly simulate the total count of True elements. With NumPy,
import numpy as np
N = 30000
p = 0.1
# Build a random number generator
rng = np.random.default_rng(123)
# Randomly determine the total number of True values
Ntrue = rng.binomial(n=N*N, p=p, size=1)[0] # 90016776
Now we can randomly determine the position of each True element by randomly choosing row and col indices without replacement.
# Randomly determine true position
position_ids = rng.choice(a=N*N, size=Ntrue, replace=False)
positions = np.unravel_index(position_ids, shape=(N,N))
And now we can populate a compressed sparse row (CSR) matrix.
from scipy import sparse
# Build a compressed sparse row matrix with the constructor:
# csr_matrix((data, (row_ind, col_ind)), [shape=(M, N)])
result = sparse.csr_matrix((np.ones(shape=Ntrue), positions), shape=(N,N))
Notice this solution avoids storing and computing 900M boolean values.
Funny enough, I wrote about a nearly identical problem before stumbling upon this question.