logo

Conditional Entropy 📂Probability Theory

Conditional Entropy

Definition 1

A joint probability mass function pp or a joint probability density function ff is given for the random variable X1,,XnX_{1}, \cdots , X_{n}. The conditional entropy of X1,,XnX_{1}, \cdots , X_{n} given that XkX_{k} is given can be stated as H(X1,,XnXk)H \left( X_{1}, \cdots , X_{n} | X_{k} \right).

Discrete

H(X1,,XnXk):=x1xnp(x1,,xn)log2p(x1,,xn)p(xk) H \left( X_{1}, \cdots , X_{n} | X_{k} \right) := - \sum_{x_{1}} \cdots \sum_{x_{n}} p \left( x_{1} , \cdots , x_{n} \right) \log_{2} {{ p \left( x_{1} , \cdots , x_{n} \right) } \over { p(x_{k}) }}

Continuous

H(X1,,XnXk):=RRf(x1,,xn)log2f(x1,,xn)f(xk)dx1dxn H \left( X_{1}, \cdots , X_{n} | X_{k} \right) := - \int_{\mathbb{R}} \cdots \int_{\mathbb{R}} f \left( x_{1} , \cdots , x_{n} \right) \log_{2} {{ f \left( x_{1} , \cdots , x_{n} \right) } \over { f(x_{k}) }} d x_{1} \cdots d x_{n}


  • The expression between X1XnX_{1} \cdots X_{n} doesn’t have XkX_{k}, although it’s messy and not written precisely. However, between x1,,xnx_{1} , \cdots , x_{n}, there is xkx_{k}.

Theorem

  • [1] For two random variables X,YX,Y, the following holds: H(X,Y)=H(X)+H(YX) H(X,Y) = H(X) + H \left( Y | X \right) Especially, if XX and YY are independent H(XY)=H(X)H(YX)=H(Y) H \left( X | Y \right) = H(X) \\ H \left( Y | X \right) = H(Y)
  • [2] Chain Rule: H(X1,,Xn)=H(X1)+H(XkX1,,Xk1)=H(X1)+H(X2X1)+H(X3X1,X2)++H(Xn)+H(XkX1,,Xn1) \begin{align*} H \left( X_{1}, \cdots , X_{n} \right) =& H \left( X_{1} \right) + H \left( X_{k} | X_{1} , \cdots , X_{k-1} \right) \\ =& H \left( X_{1} \right) + H \left( X_{2} | X_{1} \right) + H \left( X_{3} | X_{1}, X_{2} \right) + \cdots \\ & + H \left( X_{n} \right) + H \left( X_{k} | X_{1} , \cdots , X_{n-1} \right) \end{align*}

Explanation

Simply put, it is the entropy when additional conditions are given from the joint entropy. If we intuitively understand the formula, H(YX)=H(X,Y)H(X) H \left( Y | X \right) = H(X,Y) - H(X) can be seen as the uncertainty that has been resolved by the information of XX from the original disorder of H(X,Y)H(X,Y). The chain rule is a generalization of this.


  1. Applebaum. (2008). Probability and Information(2nd Edition): p236. ↩︎