Cross Entropy

Overview

Cross Entropy is the average number of bits required to distinguish between two probability distributions, commonly defined between the assumed true (reference) probability distribution $p$ and the estimated (expected) probability distribution $q$ .

Definition ¹

Discrete

Given the probability mass functions $p,q$ of two discrete probability distributions. The cross entropy $H (p,q)$ of the two distributions is defined as follows: $H (p,q) := - \sum p(x) \log_{2} q(x)$

Continuous

Let $( \Omega , \mathcal{F}, m)$ in the measure space $m$ be a measure. If $f,g$ are respectively the absolutely continuous probability density functions with respect to the measure $m$ of the two probability measures $P,Q$ , then the cross entropy $H (P,Q)$ or $H(f,g)$ of the two distributions is defined as: $H (P,Q) = H(f,g) = - \int f(x) \log_{2} g(x) dm(x)$

Explanation

One does not need to fret too much over the complexity of the definition of continuous cross entropy.

Machine Learning

Since 2010, cross entropy has been actively used as a loss function in the training process of machine learning, especially classifiers. Represented in terms of the Kullback-Leibler divergence $D_{\text{KL}} \left( p \| q \right)$ and the self-entropy $H(p)$ between the two distributions, $H(p,q) = H(p) + D_{\text{KL}} \left( p \| q \right)$ it is used as an objective function (loss function) when $p=q$ , aiming for $D_{\text{KL}} \left( p \| q \right) = 0$ . In the context of classification problems, $p$ represents the probabilities derived from actual data―for example, a label indicating whether the subject in a photo is a dog or a cat, while $q$ represents the results calculated by the classifier―probabilities of 70% for a dog, 20% for a cat, 5% for an airplane, etc. The nearly perfect alignment, hence $p \approx q$ , means the classifier almost accurately predicts the labeling, and the cross entropy is minimized.

For readers familiar with mathematics and keen on details, one might wonder why only $D_{\text{KL}} \left( p \| q \right)$ seems significant in the objective function. Since $H(p)$ , the reference entropy, wouldn’t change in computation, it seems more efficient to use $D_{\text{KL}} \left( p \| q \right)$ itself as the loss function. However, from the computer’s perspective, $\begin{align*} H (p,q) =& - \sum p(x) \log_{2} q(x) \\ D_{\text{KL}} \left( p \| q \right) =& - \sum p(x) \log_{2} {{ q(x) } \over { p(x) }} \end{align*}$ it is absolutely clear that cross entropy is easier to compute. The optimization (training) outcome would be the same anyway, there’s no need to go through the trouble of calculating $q(x) / p(x)$ . It’s not that we don’t use $D_{\text{KL}} \left( p \| q \right)$ because $H(p)$ won’t change, but rather, we use cross entropy for this reason. The conclusion is ‘in the context of loss functions, cross entropy and Kullback-Leibler divergence are conceptually identical.’

Of course, this pertains to computer calculations, and indeed in papers dealing with matrix decomposition, the Kullback-Leibler divergence might be used for the sake of convenience in mathematical development.

Notation

Note that expressions like $H(X,Y)$ might overlap with joint entropy, but surprisingly, cross entropy itself is usually mentioned only in the context of machine learning mentioned above, so it’s not typically confusing outside of study. However, one should be aware that their expressions overlap.

https://en.wikipedia.org/wiki/Cross_entropy#Definition ↩︎