Vectorizing code to calculate (squared) Mahalanobis Distiance

Question:

EDIT 2: this post seems to have been moved from CrossValidated to StackOverflow due to it being mostly about programming, but that means by fancy MathJax doesn’t work anymore. Hopefully this is still readable.

Say I want to to calculate the squared Mahalanobis distance between two vectors x and y with covariance matrix S. This is a fairly simple function defined by

M2(x, y; S) = (x - y)^T * S^-1 * (x - y)

With python’s numpy package I can do this as

# x, y = numpy.ndarray of shape (n,)
# s_inv = numpy.ndarray of shape (n, n)
diff = x - y
d2 = diff.T.dot(s_inv).dot(diff)

or in R as

diff <- x - y
d2 <- t(diff) %*% s_inv %*% diff

In my case, though, I am given

  • m by n matrix X
  • n-dimensional vector mu
  • n by n covariance matrix S

and want to find the m-dimensional vector d such that

d_i = M2(x_i, mu; S)  ( i = 1 .. m )

where x_i is the ith row of X.

This is not difficult to accomplish using a simple loop in python:

d = numpy.zeros((m,))
for i in range(m):
    diff = x[i,:] - mu
    d[i] = diff.T.dot(s_inv).dot(diff)

Of course, given that the outer loop is happening in python instead of in native code in the numpy library means it’s not as fast as it could be. $n$ and $m$ are about 3-4 and several hundred thousand respectively and I’m doing this somewhat often in an interactive program so a speedup would be very useful.

Mathematically, the only way I’ve been able to formulate this using basic matrix operations is

d = diag( X' * S^-1 * X'^T )

where

 x'_i = x_i - mu

which is simple to write a vectorized version of, but this is unfortunately outweighed by the inefficiency of calculating a 10-billion-plus element matrix and only taking the diagonal… I believe this operation should be easily expressible using Einstein notation, and thus could hopefully be evaluated quickly with numpy‘s einsum function, but I haven’t even begun to figure out how that black magic works.

So, I would like to know: is there either a nicer way to formulate this operation mathematically (in terms of simple matrix operations), or could someone suggest some nice vectorized (python or R) code that does this efficiently?

BONUS QUESTION, for the brave

I don’t actually want to do this once, I want to do it k ~ 100 times. Given:

  • m by n matrix X

  • k by n matrix U

  • Set of n by n covariance matrices each denoted S_j (j = 1..k)

Find the m by k matrix D such that

D_i,j = M(x_i, u_j; S_j)

Where i = 1..m, j = 1..k, x_i is the ith row of X and u_j is the jth row of U.

I.e., vectorize the following code:

# s_inv is (k x n x n) array containing "stacked" inverses
# of covariance matrices
d = numpy.zeros( (m, k) )
for j in range(k):
    for i in range(m):
        diff = x[i, :] - u[j, :]
        d[i, j] = diff.T.dot(s_inv[j, :, :]).dot(diff)
Asked By: JaredL

||

Answers:

First off, it seems like maybe you’re getting S and then inverting it. You shouldn’t do that; it’s slow and numerically inaccurate. Instead, you should get the Cholesky factor L of S so that S = L L^T; then

M^2(x, y; L L^T)
  = (x - y)^T (L L^T)^-1 (x - y)
  = (x - y)^T L^-T L^-1 (x - y)
  = || L^-1 (x - y) ||^2,

and since L is triangular L^-1 (x – y) can be computed efficiently.

As it turns out, scipy.linalg.solve_triangular will happily do a bunch of these at once if you reshape it properly:

L = np.linalg.cholesky(S)
y = scipy.linalg.solve_triangular(L, (X - mu[np.newaxis]).T, lower=True)
d = np.einsum('ij,ij->j', y, y)

Breaking that down a bit, y[i, j] is the ith component of L^-1 (X_j – mu). The einsum call then does

d_j = sum_i y_{ij} y_{ij}
    = sum_i y_{ij}^2
    = || y_j ||^2,

like we need.

Unfortunately, solve_triangular won’t vectorize across its first argument, so you should probably just loop there. If k is only about 100, that’s not going to be a significant issue.


If you are actually given S^-1 rather than S, then you can indeed do this with einsum more directly. Since S is quite small in your case, it’s also possible that actually inverting the matrix and then doing this would be faster. As soon as n is a nontrivial size, though, you’re throwing away a lot of numerical accuracy by doing this.

To figure out what to do with einsum, write everything in terms of components. I’ll go straight to the bonus case, writing S_j^-1 = T_j for notational convenience:

D_{ij} = M^2(x_i, u_j; S_j)
  = (x_i - u_j)^T T_j (x_i - u_j)
  = sum_k (x_i - u_j)_k ( T_j (x_i - u_j) )_k
  = sum_k (x_i - u_j)_k sum_l (T_j)_{k l} (x_i - u_j)_l
  = sum_{k l} (X_{i k} - U_{j k}) (T_j)_{k l} (X_{i l} - U_{j l})

So, if we make arrays X of shape (m, n), U of shape (k, n), and T of shape (k, n, n), then we can write this as

diff = X[np.newaxis, :, :] - U[:, np.newaxis, :]
D = np.einsum('jik,jkl,jil->ij', diff, T, diff)

where diff[j, i, k] = X_[i, k] - U[j, k].

Answered By: Danica

Dougal nailed this one with an excellent and detailed answer, but thought I’d share a small modification that I found increases efficiency in case anyone else is trying to implement this. Straight to the point:

Dougal’s method was as follows:

def mahalanobis2(X, mu, sigma):
    L = np.linalg.cholesky(sigma)
    y = scipy.linalg.solve_triangular(L, (X - mu[np.newaxis,:]).T, lower=True)
    return np.einsum('ij,ij->j', y, y)

A mathematically equivalent variant I tried is

def mahalanobis2_2(X, mu, sigma):

    # Cholesky decomposition of inverse of covariance matrix
    # (Doing this in either order should be equivalent)
    linv = np.linalg.cholesky(np.linalg.inv(sigma))

    # Just do regular matrix multiplication with this matrix
    y = (X - mu[np.newaxis,:]).dot(linv)

    # Same as above, but note different index at end because the matrix
    # y is transposed here compared to above
    return np.einsum('ij,ij->i', y, y)

Ran both versions head-to-head 20x using identical random inputs and recorded the times (in milliseconds). For X as a 1,000,000 x 3 matrix (mu and sigma 3 and 3×3) I get:

Method 1 (min/max/avg): 30/62/49
Method 2 (min/max/avg): 30/47/37

That’s about a 30% speedup for the 2nd version. I’m mostly going to be running this in 3 or 4 dimensions but to see how it scaled I tried X as 1,000,000 x 100 and got:

Method 1 (min/max/avg): 970/1134/1043
Method 2 (min/max/avg): 776/907/837

which is about the same improvement.


I mentioned this in a comment on Dougal’s answer but adding here for additional visibility:

The first pair of methods above take a single center point mu and covariance matrix sigma and calculate the squared Mahalanobis distance to each row of X. My bonus question was to do this multiple times with many sets of mu and sigma and output a two-dimensional matrix. The set of methods above can be used to accomplish this with a simple for loop, but Dougal also posted a more clever example using einsum.

I decided to compare these methods with each other by using them to solve the following problem: Given k d-dimensional normal distributions (with centers stored in rows of k by d matrix U and covariance matrices in the last two dimensions of the k by d by d array S), find the density at the n points stored in rows of the n by d matrix X.

The density of a multivariate normal distribution is a function of the squared Mahalanobis distance of the point to the mean. Scipy has an implementation of this as scipy.stats.multivariate_normal.pdf to use as a reference. I ran all three methods against each other 10x using identical random parameters each time, with d=3, k=96, n=5e5. Here are the results, in points/sec:

[Method]: (min/max/avg)
Scipy:                      1.18e5/1.29e5/1.22e5
Fancy 1:                    1.41e5/1.53e5/1.48e5
Fancy 2:                    8.69e4/9.73e4/9.03e4
Fancy 2 (cheating version): 8.61e4/9.88e4/9.04e4

where Fancy 1 is the better of the two methods above and Fancy2 is Dougal’s 2nd solution. Since the Fancy 2 needs to calculate the inverses of all the covariance matrices I also tried a “cheating version” where it was passed these as a parameter, but it looks like that didn’t make a difference. I had planned on including the non-vectorized implementation but that was so slow it would have taken all day.

What we can take away from this is that using Dougal’s first method is about 20% faster than however Scipy does it. Unfortunately despite its cleverness the 2nd method is only about 60% as fast as the first. There are probably some other optimizations that can be done but this is already fast enough for me.

I also tested how this scaled with higher dimensionality. With d=100, k=96, n=1e4:

Scipy:                      7.81e3/7.91e3/7.86e3
Fancy 1:                    1.03e4/1.15e4/1.08e4
Fancy 2:                    3.75e3/4.10e3/3.95e3
Fancy 2 (cheating version): 3.58e3/4.09e3/3.85e3

Fancy 1 seems to have an even bigger advantage this time. Also worth noting that Scipy threw a LinAlgError 8/10 times, probably because some of my randomly-generated 100×100 covariance matrices were close to singular (which may mean that the other two methods are not as numerically stable, I did not actually check the results).

Answered By: JaredL

for 3D:

# Vectorize
import numpy as np
# x=...; y=... # 3d each
xx = x.reshape(i,j*k).T
yy = y.reshape(i,j*k).T

X = np.vstack([xx,yy])
V = np.cov(X.T)
VI = np.linalg.inv(V)

# use einsum
delta = xx - yy
D = np.sqrt(np.einsum('nj,jk,nk->n', delta, VI, delta))
print(D)

# check
print( np.diag(np.sqrt(np.dot(np.dot((xx-yy),VI),(xx-yy).T))))

# check
from scipy.spatial.distance import cdist
results =  cdist(xx,yy,'mahalanobis')
results = np.diag(results)
print( results)
Answered By: JeeyCi