What is Conditional Entropy in Classical Information Theory?What is Conditional Entropy in Classical Information Theory?
Buildup
For a discrete random variable X and its probability mass function p(x), the entropy H(X) of X is defined as follows.
H(X)=−i∑p(xi)log2p(xi)
Similarly, for the joint probability mass function X,Y of p(x,y), the joint entropy H(X,Y) is defined as follows.
H(X,Y)=−i,j∑p(xi,yj)log2p(xi,yj)
Then, for the conditional probability p(xi∣yj) when Y=yj, H(X∣Y=yj) can be defined as follows.
H(X∣Y=yj)=−i∑p(xi∣yj)log2p(xi∣yj)
Naturally, H(X∣Y) can be defined as the expectation of H(X∣Y=yj) for all yj. Let’s call this conditional entropy.
H(X∣Y):=j∑p(yj)H(X∣Y=yj)=−i,j∑p(yj)p(xi∣yj)log2p(xi∣yj)=−i,j∑p(xi,yj)log2p(xi∣yj)
Definition
For a discrete random variable X,Y, the conditional entropy is defined as follows.
H(X∣Y)=−i,j∑p(xi,yj)log2p(xi∣yj)
If it’s a continuous random variable,
H(X∣Y)=−∫−∞∞p(x,y)log2p(x∣y)dxdy
Description
When you first encounter the definition of conditional entropy as (3), it might be hard to understand why we multiply by p(x,y) instead of p(x∣y). It’s easier to accept the formula for (3) if you think of it as being defined in terms of the expectation of (1). In other words, conditional entropy is the expectation of the information −log2p(x∣y) of conditional probabilities.
Properties
H(X∣Y)=H(X,Y)−H(Y)
It can be derived directly from (2).
H(X∣Y)=−i,j∑p(yj)p(xi∣yj)log2p(xi∣yj)=−i,j∑p(xi,yj)log2p(yj)p(xi,yj)=−i,j∑p(xi,yj)log2p(xi,yj)+i,j∑p(xi,yj)log2p(yj)=H(X,Y)+j∑(i∑p(xi,yj))log2p(yj)=H(X,Y)+j∑p(yj)log2p(yj)=H(X,Y)−H(Y)
Upon expanding, we get H(X,Y)=H(X∣Y)+H(Y), but since entropy is the log of probabilities, it’s acceptable to interpret that the product turns into a sum in p(x,y)=p(x∣y)p(y).