Computing KL-divergence over 2 estimated gaussian KDEs

Question:

I have two datasets with the same features and would like to estimate the "distance of distributions" between the two datasets. I had the idea to estimate a gaussian KDE in each of the datasets and computing the KL-divergence between the estimated KDEs. However, I am struggling to compute the "distance" between the distributions. This is what I have so far:

import numpy as np
from scipy import stats
from scipy.stats import entropy


dataset1 = np.random.rand(50)
dataset2 = np.random.rand(49)

kernel1 = stats.gaussian_kde(dataset1)
kernel2 = stats.gaussian_kde(dataset2)

I know I can use entropy(pk, qk) to calculate the kl-divergence but I don’t understand how do that starting from the kernels. I thought about generating some random points and using entropy(kernel1.pdf(points),kernel2.pdf(points)) but the pdf function outputs some weird number (higher than 1 sometimes, does it mean it assigns more than 100% of prob??), and I am not sure the output I get is correct.

If anyone knows how to calculate the distance between the 2 gaussian kde kernels I would be very thankful.

Asked By: f.leno

||

Answers:

There is no closed form solution for KL between two mixtures of gaussians.

KL(p, q) := -E_p log [p(x)/q(x)]

so you can use MC estimator:

def KL_mc(p, q, n=100):
  points = p.resample(n)
  p_pdf = p.pdf(points)
  q_pdf = q.pdf(points)
  return np.log(p_pdf / q_pdf).mean()

Note:

  • you might need to add some clipping to avoid 0s and infinities
  • depending on the dimensionality of the space this can require quite large n

(higher than 1 sometimes, does it mean it assigns more than 100% of prob??)

PDF is not a probability. Not for continuous distributions. It is a probability density. It is a relative measure. Probability assigned to any single value is always 0, but probability of sampling an element in a given set/interval equals integral of pdf over this set/integral (and thus pointwise it can have a weight >1, but over a "small enough" set)

More general solution

Overall, unless you really need KL for theoretical reasons, there are divergences that are better suited to deal with gaussian mixtures (e.g. such that have closed form solutions), for example Cauchy-Schwarz Divergence.

In particular you can look at Maximum Entropy Linear Manifold which is based exactly on computing CS divergences between KDEs of points. You can see python implementation in melm/dcsk.py in value(v) function on github. In your case you do not want a projection so just put v = identity matrix.

    def value(self, v):
        # We need matrix, not vector
        v = v.reshape(-1, self.k)

        ipx0 = self._ipx(self.x0, self.x0, v)
        ipx1 = self._ipx(self.x1, self.x1, v)
        ipx2 = self._ipx(self.x0, self.x1, v)

        return np.log(ipx0) + np.log(ipx1) - 2 * np.log(ipx2)

    def _f1(self, X0, X1, v):
        Hxy = self.gamma * self.gamma * self._H(X0, X1)
        vHv = v.T.dot(Hxy).dot(v)
        return 1.0 / (X0.shape[0] * X1.shape[0] * np.sqrt(la.det(vHv)) * (2 * np.pi) ** (self.k / 2))

    def _f2(self, X0, X1, v):
        Hxy = self.gamma * self.gamma * self._H(X0, X1)
        vHv = v.T.dot(Hxy).dot(v)
        vHv_inv = la.inv(vHv)

        vx0 = X0.dot(v)
        vx1 = X1.dot(v)
        vx0c = vx0.dot(vHv_inv)
        vx1c = vx1.dot(vHv_inv)

        ret = 0.0
        for i in range(X0.shape[0]):
            ret += np.exp(-0.5 * ((vx0c[i] - vx1c) * (vx0[i] - vx1)).sum(axis=1)).sum()
        return ret

    def _ipx(self, X0, X1, v):
        return self._f1(X0, X1, v) * self._f2(X0, X1, v)

Main difference between CS and KL is that KL requires your to compute integral of a logarithm of a pdf and CS computes logarithm of the integral. It happens, that with gaussian mixtures it is this integration of the logarithm that is a problem, without the logarithm everything is easy, and thus DCS is preferable.

Answered By: lejlot