3 functions for computing relative entropy in scipy. What's the difference?

Question:

Scipy in python offers the following functions that seem to compute the same information theory measure, Kullback-Leibler divergence, which is also called relative entropy:

  • scipy.stats.entropy, which can be switched to computing KL-divergence if qk=None
  • scipy.special.rel_entr
  • scipy.special.kl_div

Why three of the same thing? Could someone explain the difference between them?

Asked By: develarist

||

Answers:

The default option for computing KL-divergence between discrete probability vectors would be scipy.stats.entropy.

In contrast, both scipy.special.rel_entr and scipy.special.kl_div are "element-wise functions" that can be used in conjunction with the usual array operations, and have to be summed before they yield the aggregate relative entropy value.

While both result in the same sum (when used with proper probability vectors the elements of which sum to 1), the second variant (scipy.special.kl_div) is different element-wise in that it adds -x +y terms, i.e.,

(x log(x/y)) – x + y

which cancel out in the sum.

For example

from numpy import array
from scipy.stats import entropy
from scipy.special import rel_entr, kl_div

p = array([1/2, 1/2])
q = array([1/10, 9/10])

print(entropy(p, q))
print(rel_entr(p, q), sum(rel_entr(p, q)))
print(kl_div(p, q), sum(kl_div(p, q)))

yields

0.5108256237659907
[ 0.80471896 -0.29389333] 0.5108256237659907
[0.40471896 0.10610667] 0.5108256237659906

I am not familiar with the rationale behind the element-wise extra-terms of scipy.special.kl_div but the documentation points to a reference that might explain more.

See:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.kl_div.html#scipy.special.kl_div

Answered By: Mario Boley

While Mario Boley’s accepted response partly answers the question with a good example, the reason why the term -x + y is added is not explained.

The expression (x log(x/y)) can be positive or negative, depending on the values of x and y. Particularly, it is negative if y > x.

As KL divergence is used as a distance metric, it may be convenient to make it non-negative. For example, if you’re using several such distance metrics and want to compute the average as an overall metric, all terms need to be positive.

Adding y – x to (x log(x/y)) make it positive, when y > x (read final paragraph below). When x > y, the term is already positive. However, when taking sum for all (x,y) in {(x1, x2, …, xn), (y1, y2, …, yn)} to compute KL divergence, the effect of adding y- x on "negative terms" needs to be compensated for by adding the same term for "positive terms" as well.

Now, how do you know that f = x log(x/y) – x + x >= 0 for any (x, y)? Well! compute the Hessian and look at the eigen values. They’ll be > 0 (as long as x > 0, y > 0). Compute the gradient and equate to zero. You’ll find that the gradient is zero when x = y, for which f = 0. Hence, f >= 0 for all (x,y).

https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence

Answered By: SANDEEP PALAKKAL