logo

Relative Entropy, Kullback-Leibler Divergence 📂Probability Theory

Relative Entropy, Kullback-Leibler Divergence

Buildup

When there are two probability distributions $P$ and $Q$, it’s easy to imagine situations where one might wonder how different these two are.

For instance, consider the situation of guessing exactly what number is captured by a camera. The numbers 6 and 9 can be quite confusing without clear indication of their top and bottom, and since there are people who draw a line below 9 and those who don’t, gathering handwriting data from numerous people would make it difficult to assertively determine between the two.

Let’s think about creating artificial intelligence that distinguishes between 6 and 9 in this situation. The notion that this Classifier learns best, meaning that it mimics humans most accurately, would be to replicate the probability distribution, including ambiguity, from the human-given data as similarly as possible. Naturally, this motivation leads to ’the difference between probability distributions,’ and exploring techniques to minimize this will bring us one step closer to the AI we desire.

Meanwhile, it wasn’t necessary for the two probability distributions in the motivation above to be equivalent, and in fact, equivalence might even pose a problem. When we are interested in the (reference) probability distribution $P$, how different $Q$ is from $P$ may not matter, but it would be troublesome if it were so different as to be incomparable. For instance, if $P$ were a binomial distribution and $Q$ were an exponential distribution, there would be some insurmountable difference between the two, and even if narrowed down as much as possible, they would inherently be too different. Thus, at least in this context, it would be better if $P$ were some fixed standard.

Now, let’s imagine a specific function to compare the two. Like the distance function, the most naive method when thinking of a difference, as in functional analysis, would be to simply consider the difference in the probability density functions $p,q$ of the two probability distributions. $$ D(p,q) \overset{?}{=} \sup | p(x) - q(x) | $$ The problem, as pointed out in the paragraph above, is that this definition is too indifferent to the reference probability distribution $P$. With such a definition alone, it’s too vague whether the probability distributions differ due to actual differences or just strange parameters. At this point, let’s consider Shannon entropy $H$.

Shannon entropy: $$ H(P) := - \sum p(x) \log_{2} p(x) $$

This can be seen as $E \left( - \log_{2} p(X) \right)$, the average of a continuous function $- \log_{2} p(x)$ regarding the probability distribution $p(x)$, in terms of the concept of expectation value. From the discussion so far, if entropy is well utilized, it seems possible to define a function that appropriately represents the difference between the two while also showing interest in the reference probability distribution $P$. Let’s remember cross entropy.

Cross entropy: $$ H (P,Q) := - \sum p(x) \log_{2} q(x) $$

Without overthinking it, if we just smash all the mentioned elements above, we can imagine the following function $D_{\text{KL}} \left( P , Q \right)$.

$$ \begin{align*} & \left| E_{P} \left[ \log_{2} q(X) - \log_{2} p(X) \right] \right| \\ =& \left| - \sum p(x) \log_{2} q(x) + \sum p(x) \log_{2} p(x) \right| \\ =& \left| H \left( P, Q \right) - H \left( P \right) \right| \\ \overset{?}{=}& D_{\text{KL}} \left( P , Q \right) \end{align*} $$

Definition

Given the probability mass functions $p,q$ or probability density functions $f,g$ of the two probability distributions $P, Q$, the following defined $D_{\text{KL}} \left( P \| Q \right)$ is referred to as the Kullback-Leibler Divergence or Relative Entropy from $Q$ to $P$.

$$ \begin{align*} D_{\text{KL}} \left( P \| Q \right) :=& H(P,Q) - H(P) \\ \overset{\text{or}}{=}& - \sum p(x) \log_{2} {{ q(x) } \over { p(x) }} \\ \overset{\text{or}}{=}& - \int_{\mathbb{R}} f(x) \log_{2} {{ g(x) } \over { f(x) }} dx \end{align*} $$

Explanation

$$ D_{\text{KL}} \left( P \| Q \right) = - \sum p_{k} \log_{2} {{ q_{k} } \over { p_{k} }} $$

Understanding the meaning of the Kullback-Leibler divergence from the definition alone by just looking at the above expression is not easy even for mathematics majors. The key to accepting the formula lies in the logarithm $\log_{2}$. $$ \begin{align*} D_{\text{KL}} \left( P \| Q \right) =& - \sum p_{k} \log_{2} {{ q_{k} } \over { p_{k} }} \\ =& - \sum p_{k} \log_{2} q_{k} - \left( - \sum p_{k} \log_{2} p_{k} \right) \\ =& H ( P, Q ) - H ( P ) \end{align*} $$ As mentioned in the buildup, the reason why a simple ‘difference’ turns into a fraction form is because of the logarithm. If we tear out what’s inside the logarithm ${{ q } \over { p }}$, subtraction comes out and one might start to understand what it means. Of course, to know exactly, this alone is not enough, and one must also properly study Shannon entropy and cross entropy.

A Simple Measure to Represent Divergence

According to Gibbs’ inequality, the cross entropy is always greater than or equal to the standalone entropy, and thus always $D_{\text{KL}} \ge 0$. Meanwhile, when the two distributions are exactly the same, that is, $p=q$, then $p/q = 1$, thus $$ D_{\text{KL}} \left( P | Q \right) = - \sum p(x) \log_{2} {{ q(x) } \over { p(x) }} = - \sum p(x) \cdot 0 = 0 $$ and, as common sense dictates, $D_{\text{KL}} = 0$ is obtained. Thinking of it in intuitive terms, it only makes sense that the Kullback-Leibler divergence itself represents how different $Q$ is from $P$.

Not a Distance Function

$$ D_{\text{KL}} \left( P \| Q \right) \ne D_{\text{KL}} \left( Q \| P \right) $$

Generally, there is no symmetry as illustrated above. This is because the reference probability distribution is set as $P$ when calculating the expectation value. If this were to change to $Q$, then the Kullback-Leibler divergence could also change as much as it likes, since from the perspective of $P$, $Q$ may feel alien whereas from the standpoint of $Q$, it may seem similar to $P$. This relativity is why the Kullback-Leibler divergence cannot serve as a distance function and is also called the Relative Entropy.

Machine Learning

In machine learning, cross entropy is more widely known than Kullback-Leibler divergence. In fact, in machine learning rather than information theory, there’s conceptually not much difference between the two, and strictly speaking, Kullback-Leibler divergence might even be the more appropriate function, but cross entropy is preferred in actual calculations because it’s more advantageous.

Why use cross entropy instead of Kullback-Leibler divergence: … When shown in terms of the standalone entropy $H(p)$, $$ H(p,q) = H(p) + D_{\text{KL}} \left( p \| q \right) $$ it’s used as the objective function (loss function) when $p=q$ is $D_{\text{KL}} \left( p \| q \right) = 0$. … Since $H(p)$ is standalone, (reference) entropy, it doesn’t change in the calculation … However, from the perspective of a computer, $$ \begin{align*} H (p,q) =& - \sum p(x) \log_{2} q(x) \\ D_{\text{KL}} \left( p \| q \right) =& - \sum p(x) \log_{2} {{ q(x) } \over { p(x) }} \end{align*} $$ the clear choice for ease of calculation is cross entropy. …

Therefore, it’s advisable to study the concepts with Kullback-Leibler divergence but use cross entropy in practice. Of course, whether or not one fully understands how these are the same or how they differ, makes a significant difference as a theorist.