Vectorizing multivariate normal distribution calculation
Question:
I have n points in 3D space, each with a corresponding guess and certainty attached to it. I want to calculate the multivariate normal distribution for each point given its guess and certainty. Currently, I’m using an iterative approach and the Scipy.stats
multivariate_normal
function, as shown in the code snippet below:
import numpy as np
from scipy.stats import multivariate_normal
n = 10
def create_points(n):
return np.random.randint(0, 1000, size=(n, 3))
real = create_points(n)
guess = create_points(n)
uncertainties = np.random.randint(1, 100, size=n)
def iterative_scoring_scipy(real, guess, uncertainties):
score = 0.0
covariances = [
np.diag([uncertainties[i]**2]*3)
for i in range(len(real))
]
for i in range(n):
score += multivariate_normal.pdf(real[i], mean=guess[i], cov=covariances[i])
return score
print(iterative_scoring_scipy(real, guess, uncertainties))
Here is an attempt that does not use scipy
but instead uses numpy:
def iterative_scoring_numpy(real, guess, uncertainties):
score = 0.0
for i in range(n):
# calculate the covariance matrix
cov = np.diag([uncertainties[i]**2]*3)
# calculate the determinant and inverse of the covariance matrix
det = np.linalg.det(cov)
inv = np.linalg.inv(cov)
# calculate the difference between the real and guess points
diff = real[i] - guess[i]
# calculate the exponent
exponent = -0.5 * np.dot(np.dot(diff, inv), diff)
# calculate the constant factor
const = 1 / np.sqrt((2 * np.pi)**3 * det)
# calculate the probability density function value
pdf = const * np.exp(exponent)
# add the probability density function value to the score
score += pdf
return score
print(iterative_scoring_numpy(real, guess, uncertainties))
However, both approaches are slow and I’m looking for a way to speed it up using vectorization. How can I vectorize either code snippets to make it faster?
Answers:
multivariate_normal.pdf
does not seems to be vectorized. Even if it would be, its code is very generic and inefficient (it makes many repeated checks, call functions that support ND arrays and supports corner cases that does not happen in your code anyway). Your second attempt is a good try but the Numpy functions operate on small vector/matrices while they can operate on the whole dataset. Put it shortly, the whole loop can be vectorized. I choose to read the SciPy code so to write my own vectorized implementation of multivariate_normal.pdf
supporting only your kind of inputs (i.e. 3D points). Here is the result:
# [...] same as in the question
# Vectorized equivalent to multivariate_normal.pdf
def mvn_pdf(x, mean, cov):
mean = np.asarray(mean, dtype=float)
cov = np.asarray(cov, dtype=float)
assert x.ndim == 2 and x.shape[1] == 3
assert mean.ndim == 2 and mean.shape[1] == 3
assert cov.ndim == 3 and cov.shape[1:] == (3, 3)
s, u = np.linalg.eigh(cov)
eps = 2.22e-10 * np.max(np.abs(s), axis=1)
# Each covariance matrix must be symmetric positive definite
assert np.all(abs(s) > eps[:, None])
# The rest of the code is unsafe if this is not true
# It enable further optimizations
assert np.allclose(u, np.identity(3))
log_pdet = np.sum(np.log(s), axis=1)
maha = np.sum((x - mean)**2 / s, axis=1)
log_2pi = np.log(2 * np.pi)
logpdf = -0.5 * (3 * log_2pi + log_pdet + maha)
return np.exp(logpdf)
def fast_iterative_scoring_scipy(real, guess, uncertainties):
covariances = (np.identity(3) * (uncertainties ** 2)[:,None,None])
score = np.sum(mvn_pdf(np.array(real), np.array(guess), covariances))
return score
print(iterative_scoring_numpy(real, guess, uncertainties))
This is about 10 times faster with n=10 and about more than 50 times faster with n=100 on my machine (processor i5-9600KF, Numpy 1.22.4 and SciPy 1.6.3). The bigger the input, the better the speed-up.
Note that you can safely remove the assert if you are sure they are always true in your case. This should be about twice faster without the assertions.
I have n points in 3D space, each with a corresponding guess and certainty attached to it. I want to calculate the multivariate normal distribution for each point given its guess and certainty. Currently, I’m using an iterative approach and the Scipy.stats
multivariate_normal
function, as shown in the code snippet below:
import numpy as np
from scipy.stats import multivariate_normal
n = 10
def create_points(n):
return np.random.randint(0, 1000, size=(n, 3))
real = create_points(n)
guess = create_points(n)
uncertainties = np.random.randint(1, 100, size=n)
def iterative_scoring_scipy(real, guess, uncertainties):
score = 0.0
covariances = [
np.diag([uncertainties[i]**2]*3)
for i in range(len(real))
]
for i in range(n):
score += multivariate_normal.pdf(real[i], mean=guess[i], cov=covariances[i])
return score
print(iterative_scoring_scipy(real, guess, uncertainties))
Here is an attempt that does not use scipy
but instead uses numpy:
def iterative_scoring_numpy(real, guess, uncertainties):
score = 0.0
for i in range(n):
# calculate the covariance matrix
cov = np.diag([uncertainties[i]**2]*3)
# calculate the determinant and inverse of the covariance matrix
det = np.linalg.det(cov)
inv = np.linalg.inv(cov)
# calculate the difference between the real and guess points
diff = real[i] - guess[i]
# calculate the exponent
exponent = -0.5 * np.dot(np.dot(diff, inv), diff)
# calculate the constant factor
const = 1 / np.sqrt((2 * np.pi)**3 * det)
# calculate the probability density function value
pdf = const * np.exp(exponent)
# add the probability density function value to the score
score += pdf
return score
print(iterative_scoring_numpy(real, guess, uncertainties))
However, both approaches are slow and I’m looking for a way to speed it up using vectorization. How can I vectorize either code snippets to make it faster?
multivariate_normal.pdf
does not seems to be vectorized. Even if it would be, its code is very generic and inefficient (it makes many repeated checks, call functions that support ND arrays and supports corner cases that does not happen in your code anyway). Your second attempt is a good try but the Numpy functions operate on small vector/matrices while they can operate on the whole dataset. Put it shortly, the whole loop can be vectorized. I choose to read the SciPy code so to write my own vectorized implementation of multivariate_normal.pdf
supporting only your kind of inputs (i.e. 3D points). Here is the result:
# [...] same as in the question
# Vectorized equivalent to multivariate_normal.pdf
def mvn_pdf(x, mean, cov):
mean = np.asarray(mean, dtype=float)
cov = np.asarray(cov, dtype=float)
assert x.ndim == 2 and x.shape[1] == 3
assert mean.ndim == 2 and mean.shape[1] == 3
assert cov.ndim == 3 and cov.shape[1:] == (3, 3)
s, u = np.linalg.eigh(cov)
eps = 2.22e-10 * np.max(np.abs(s), axis=1)
# Each covariance matrix must be symmetric positive definite
assert np.all(abs(s) > eps[:, None])
# The rest of the code is unsafe if this is not true
# It enable further optimizations
assert np.allclose(u, np.identity(3))
log_pdet = np.sum(np.log(s), axis=1)
maha = np.sum((x - mean)**2 / s, axis=1)
log_2pi = np.log(2 * np.pi)
logpdf = -0.5 * (3 * log_2pi + log_pdet + maha)
return np.exp(logpdf)
def fast_iterative_scoring_scipy(real, guess, uncertainties):
covariances = (np.identity(3) * (uncertainties ** 2)[:,None,None])
score = np.sum(mvn_pdf(np.array(real), np.array(guess), covariances))
return score
print(iterative_scoring_numpy(real, guess, uncertainties))
This is about 10 times faster with n=10 and about more than 50 times faster with n=100 on my machine (processor i5-9600KF, Numpy 1.22.4 and SciPy 1.6.3). The bigger the input, the better the speed-up.
Note that you can safely remove the assert if you are sure they are always true in your case. This should be about twice faster without the assertions.