Conditional Entropy

Definition ¹

A joint probability mass function $p$ or a joint probability density function $f$ is given for the random variable $X_{1}, \cdots , X_{n}$ . The conditional entropy of $X_{1}, \cdots , X_{n}$ given that $X_{k}$ is given can be stated as $H \left( X_{1}, \cdots , X_{n} | X_{k} \right)$ .

Discrete

$H \left( X_{1}, \cdots , X_{n} | X_{k} \right) := - \sum_{x_{1}} \cdots \sum_{x_{n}} p \left( x_{1} , \cdots , x_{n} \right) \log_{2} {{ p \left( x_{1} , \cdots , x_{n} \right) } \over { p(x_{k}) }}$

Continuous

$H \left( X_{1}, \cdots , X_{n} | X_{k} \right) := - \int_{\mathbb{R}} \cdots \int_{\mathbb{R}} f \left( x_{1} , \cdots , x_{n} \right) \log_{2} {{ f \left( x_{1} , \cdots , x_{n} \right) } \over { f(x_{k}) }} d x_{1} \cdots d x_{n}$

The expression between $X_{1} \cdots X_{n}$ doesn’t have $X_{k}$ , although it’s messy and not written precisely. However, between $x_{1} , \cdots , x_{n}$ , there is $x_{k}$ .

Theorem

[1] For two random variables $X,Y$ , the following holds: $H(X,Y) = H(X) + H \left( Y | X \right)$ Especially, if $X$ and $Y$ are independent $H \left( X | Y \right) = H(X) \\ H \left( Y | X \right) = H(Y)$
[2] Chain Rule: $\begin{align*} H \left( X_{1}, \cdots , X_{n} \right) =& H \left( X_{1} \right) + H \left( X_{k} | X_{1} , \cdots , X_{k-1} \right) \\ =& H \left( X_{1} \right) + H \left( X_{2} | X_{1} \right) + H \left( X_{3} | X_{1}, X_{2} \right) + \cdots \\ & + H \left( X_{n} \right) + H \left( X_{k} | X_{1} , \cdots , X_{n-1} \right) \end{align*}$

Explanation

Simply put, it is the entropy when additional conditions are given from the joint entropy. If we intuitively understand the formula, $H \left( Y | X \right) = H(X,Y) - H(X)$ can be seen as the uncertainty that has been resolved by the information of $X$ from the original disorder of $H(X,Y)$ . The chain rule is a generalization of this.

Applebaum. (2008). Probability and Information(2nd Edition): p236. ↩︎