Vectorizing multivariate normal distribution calculation

Question:

I have n points in 3D space, each with a corresponding guess and certainty attached to it. I want to calculate the multivariate normal distribution for each point given its guess and certainty. Currently, I’m using an iterative approach and the Scipy.stats multivariate_normal function, as shown in the code snippet below:

import numpy as np
from scipy.stats import multivariate_normal

n = 10

def create_points(n):
    return np.random.randint(0, 1000, size=(n, 3))


real = create_points(n)
guess = create_points(n)
uncertainties = np.random.randint(1, 100, size=n)

def iterative_scoring_scipy(real, guess, uncertainties):
    score = 0.0
    covariances = [
        np.diag([uncertainties[i]**2]*3)
        for i in range(len(real))
    ]
    for i in range(n):
        score += multivariate_normal.pdf(real[i], mean=guess[i], cov=covariances[i])

    return score

print(iterative_scoring_scipy(real, guess, uncertainties))

Here is an attempt that does not use scipy but instead uses numpy:

def iterative_scoring_numpy(real, guess, uncertainties):
    score = 0.0
    for i in range(n):
        # calculate the covariance matrix
        cov = np.diag([uncertainties[i]**2]*3)

        # calculate the determinant and inverse of the covariance matrix
        det = np.linalg.det(cov)
        inv = np.linalg.inv(cov)

        # calculate the difference between the real and guess points
        diff = real[i] - guess[i]

        # calculate the exponent
        exponent = -0.5 * np.dot(np.dot(diff, inv), diff)

        # calculate the constant factor
        const = 1 / np.sqrt((2 * np.pi)**3 * det)

        # calculate the probability density function value
        pdf = const * np.exp(exponent)

        # add the probability density function value to the score
        score += pdf

    return score

print(iterative_scoring_numpy(real, guess, uncertainties))

However, both approaches are slow and I’m looking for a way to speed it up using vectorization. How can I vectorize either code snippets to make it faster?

Asked By: NicolaiF

||

Answers:

multivariate_normal.pdf does not seems to be vectorized. Even if it would be, its code is very generic and inefficient (it makes many repeated checks, call functions that support ND arrays and supports corner cases that does not happen in your code anyway). Your second attempt is a good try but the Numpy functions operate on small vector/matrices while they can operate on the whole dataset. Put it shortly, the whole loop can be vectorized. I choose to read the SciPy code so to write my own vectorized implementation of multivariate_normal.pdf supporting only your kind of inputs (i.e. 3D points). Here is the result:

# [...] same as in the question

# Vectorized equivalent to multivariate_normal.pdf
def mvn_pdf(x, mean, cov):
    mean = np.asarray(mean, dtype=float)
    cov = np.asarray(cov, dtype=float)

    assert x.ndim == 2 and x.shape[1] == 3
    assert mean.ndim == 2 and mean.shape[1] == 3
    assert cov.ndim == 3 and cov.shape[1:] == (3, 3)

    s, u = np.linalg.eigh(cov)
    eps = 2.22e-10 * np.max(np.abs(s), axis=1)

    # Each covariance matrix must be symmetric positive definite
    assert np.all(abs(s) > eps[:, None])

    # The rest of the code is unsafe if this is not true 
    # It enable further optimizations
    assert np.allclose(u, np.identity(3))

    log_pdet = np.sum(np.log(s), axis=1)

    maha = np.sum((x - mean)**2 / s, axis=1)
    log_2pi = np.log(2 * np.pi)
    logpdf = -0.5 * (3 * log_2pi + log_pdet + maha)

    return np.exp(logpdf)

def fast_iterative_scoring_scipy(real, guess, uncertainties):
    covariances = (np.identity(3) * (uncertainties ** 2)[:,None,None])
    score = np.sum(mvn_pdf(np.array(real), np.array(guess), covariances))
    return score

print(iterative_scoring_numpy(real, guess, uncertainties))

This is about 10 times faster with n=10 and about more than 50 times faster with n=100 on my machine (processor i5-9600KF, Numpy 1.22.4 and SciPy 1.6.3). The bigger the input, the better the speed-up.

Note that you can safely remove the assert if you are sure they are always true in your case. This should be about twice faster without the assertions.

Answered By: Jérôme Richard