Conditional Entropy
Definition 1
A joint probability mass function $p$ or a joint probability density function $f$ is given for the random variable $X_{1}, \cdots , X_{n}$. The conditional entropy of $X_{1}, \cdots , X_{n}$ given that $X_{k}$ is given can be stated as $H \left( X_{1}, \cdots , X_{n} | X_{k} \right)$.
Discrete
$$ H \left( X_{1}, \cdots , X_{n} | X_{k} \right) := - \sum_{x_{1}} \cdots \sum_{x_{n}} p \left( x_{1} , \cdots , x_{n} \right) \log_{2} {{ p \left( x_{1} , \cdots , x_{n} \right) } \over { p(x_{k}) }} $$
Continuous
$$ H \left( X_{1}, \cdots , X_{n} | X_{k} \right) := - \int_{\mathbb{R}} \cdots \int_{\mathbb{R}} f \left( x_{1} , \cdots , x_{n} \right) \log_{2} {{ f \left( x_{1} , \cdots , x_{n} \right) } \over { f(x_{k}) }} d x_{1} \cdots d x_{n} $$
- The expression between $X_{1} \cdots X_{n}$ doesn’t have $X_{k}$, although it’s messy and not written precisely. However, between $x_{1} , \cdots , x_{n}$, there is $x_{k}$.
Theorem
- [1] For two random variables $X,Y$, the following holds: $$ H(X,Y) = H(X) + H \left( Y | X \right) $$ Especially, if $X$ and $Y$ are independent $$ H \left( X | Y \right) = H(X) \\ H \left( Y | X \right) = H(Y) $$
- [2] Chain Rule: $$ \begin{align*} H \left( X_{1}, \cdots , X_{n} \right) =& H \left( X_{1} \right) + H \left( X_{k} | X_{1} , \cdots , X_{k-1} \right) \\ =& H \left( X_{1} \right) + H \left( X_{2} | X_{1} \right) + H \left( X_{3} | X_{1}, X_{2} \right) + \cdots \\ & + H \left( X_{n} \right) + H \left( X_{k} | X_{1} , \cdots , X_{n-1} \right) \end{align*} $$
Explanation
Simply put, it is the entropy when additional conditions are given from the joint entropy. If we intuitively understand the formula, $$ H \left( Y | X \right) = H(X,Y) - H(X) $$ can be seen as the uncertainty that has been resolved by the information of $X$ from the original disorder of $H(X,Y)$. The chain rule is a generalization of this.
Applebaum. (2008). Probability and Information(2nd Edition): p236. ↩︎