Conditional Expectation Minimizes the Sum of Squared Deviations
📂Mathematical StatisticsConditional Expectation Minimizes the Sum of Squared Deviations
Summary
The following holds true:
E[Y∣X]=f(X)argminE[(Y−f(X))2∣X]
E[Y∣X]=f(X)argminE[(Y−f(X))2]
Proof
(1)
f(X)argminE[(Y−f(X))2∣X]=f(X)argminE[Y2−2Yf(X)+f(X)2∣X]=f(X)argmin(E[Y2∣X]−2E[Yf(X)∣X]+E[f(X)2∣X])=f(X)argmin(−2E[Yf(X)∣X]+E[f(X)2∣X])
The fourth equality holds because the first term is independent of f(X).
The smoothing property of conditional expectation (Pulling out known factors)
[a] E[f(X)Y∣X]=f(X)E[Y∣X]
By the properties of conditional expectation, the equation can be rearranged as follows.
f(X)argmin(−2E[Yf(X)∣X]+E[f(X)2∣X])=[a]f(X)argmin(−2f(X)E[Y∣X]+f(X)2E[1∣X])=f(X)argmin(−2f(X)E[Y∣X]+f(X)2)
Differentiating to find the minimum value of f(X), we get
df(X)d(−2f(X)E[Y∣X]+f(X)2)=−2E[Y∣X]+2f(X)=0
⟹f(X)=E[Y∣X]
■
(2)
Subtract and add E[Y∣X], then wrap it front and back.
f(X)argminE[(Y−f(X))2]=f(X)argminE[(Y−E[Y∣X]+E[Y∣X]−f(X))2]=f(X)argminE[((Y−E[Y∣X])+(E[Y∣X]−f(X)))2]
Since this forms a perfect square identical to (a+b)2=a2+b2+2ab, by the linearity of expectation, we get
f(X)argminE[(Y−f(X))2]=f(X)argminE[((Y−E[Y∣X])+(E[Y∣X]−f(X)))2]=f(X)argmin(E[(Y−E[Y∣X])2]+E[(E[Y∣X]−f(X))2]+2E[(Y−E[Y∣X])(E[Y∣X]−f(X))])
The first term is independent of f(X), so it can be rearranged as follows.
f(X)argminE[(Y−f(X))2]=f(X)argmin(E[(E[Y∣X]−f(X))2]−2E[(Y−E[Y∣X])(E[Y∣X]−f(X))])
Let’s compute the second term.
The smoothing property of conditional expectation (Pulling out known factors)
[a] E[f(X)Y∣X]=f(X)E[Y∣X]
The property of conditional expectation (Law of total expectation)
[b] EX[EY[Y∣X=x]]=EY[Y]
Applying the above properties to the second term of (3), we get
E[(Y−E[Y∣X])(E[Y∣X]−f(X))]=[b]E[E[(Y−E[Y∣X])(E[Y∣X]−f(X))X]]=[a]E[(E[Y∣X]−f(X))E[(Y−E[Y∣X])X]]=linearityE[(E[Y∣X]−f(X))(E[Y∣X]−E[E[Y∣X]X])]
From the formula above, the last term is calculated as E[E[Y∣X]X]=[a]E[Y∣X]E[1∣X]=E[Y∣X],
E[(Y−E[Y∣X])(E[Y∣X]−f(X))]=E[(E[Y∣X]−f(X))(E[Y∣X]−E[Y∣X])]=E[(E[Y∣X]−f(X))⋅0]=0
Therefore, the second term of (3) is 0 and it is rearranged as follows.
f(X)argminE[(E[Y∣X]−f(X))2]
Therefore,
f(X)argminE[(Y−f(X))2]=f(X)argminE[(E[Y∣X]−f(X))2]=E[Y∣X]
■