logo

Conditional Expectation Minimizes the Sum of Squared Deviations 📂Mathematical Statistics

Conditional Expectation Minimizes the Sum of Squared Deviations

Summary

The following holds true:

E[YX]=arg minf(X)E[(Yf(X))2X] \begin{equation} E\left[ Y | X \right] = \argmin_{f(X)} E\left[ (Y - f(X))^{2} | X \right] \end{equation}

E[YX]=arg minf(X)E[(Yf(X))2] \begin{equation} E\left[ Y | X \right] = \argmin_{f(X)} E\left[ (Y - f(X))^{2} \right] \end{equation}

Proof

(1)

arg minf(X)E[(Yf(X))2X]=arg minf(X)E[Y22Yf(X)+f(X)2X]=arg minf(X)(E[Y2X]2E[Yf(X)X]+E[f(X)2X])=arg minf(X)(2E[Yf(X)X]+E[f(X)2X]) \begin{align*} & \argmin_{f(X)} E\left[ (Y - f(X))^{2} | X \right] \\ &= \argmin_{f(X)} E\left[ Y^{2} - 2Yf(X) + f(X)^{2} | X \right] \\ &= \argmin_{f(X)} \left( E\left[ Y^{2} | X \right] - 2E\left[ Yf(X) | X \right] + E\left[ f(X)^{2} | X \right] \right) \\ &= \argmin_{f(X)} \left( - 2E\left[ Yf(X) | X \right] + E\left[ f(X)^{2} | X \right] \right) \\ \end{align*}

The fourth equality holds because the first term is independent of f(X)f(X).

The smoothing property of conditional expectation (Pulling out known factors)(\text{Pulling out known factors})

[a] E[f(X)YX]=f(X)E[YX]E[f(X)Y | X] = f(X) E[Y | X]

By the properties of conditional expectation, the equation can be rearranged as follows.

arg minf(X)(2E[Yf(X)X]+E[f(X)2X])=[a]arg minf(X)(2f(X)E[YX]+f(X)2E[1X])=arg minf(X)(2f(X)E[YX]+f(X)2) \begin{align*} & \argmin_{f(X)} \left( - 2E\left[ Yf(X) | X \right] + E\left[ f(X)^{2} | X \right] \right) \\ &\overset{[a]}{=} \argmin_{f(X)} \left( - 2f(X) E\left[ Y| X \right] + f(X)^{2}E\left[ 1 | X \right] \right) \\ &= \argmin_{f(X)} \left( - 2f(X) E\left[ Y| X \right] + f(X)^{2} \right) \\ \end{align*}

Differentiating to find the minimum value of f(X)f(X), we get

d(2f(X)E[YX]+f(X)2)df(X)=2E[YX]+2f(X)=0 \dfrac{d (- 2f(X) E\left[ Y| X \right] + f(X)^{2})}{d f(X)} = -2E\left[ Y | X \right] + 2f(X) = 0     f(X)=E[YX] \implies f(X) = E[Y | X]

(2)1

Subtract and add E[YX]E\left[ Y | X \right], then wrap it front and back.

arg minf(X)E[(Yf(X))2]=arg minf(X)E[(YE[YX]+E[YX]f(X))2]=arg minf(X)E[((YE[YX])+(E[YX]f(X)))2] \begin{align*} \argmin_{f(X)} E\left[ (Y - f(X))^{2} \right] &= \argmin_{f(X)} E\left[ (Y - E\left[ Y | X \right] + E\left[ Y | X \right] - f(X))^{2} \right] \\ &= \argmin_{f(X)} E\left[ \Big( (Y - E\left[ Y | X \right]) + (E\left[ Y | X \right] - f(X)) \Big)^{2} \right] \end{align*}

Since this forms a perfect square identical to (a+b)2=a2+b2+2ab(a + b)^{2} = a^{2} + b^{2} + 2ab, by the linearity of expectation, we get

arg minf(X)E[(Yf(X))2]=arg minf(X)E[((YE[YX])+(E[YX]f(X)))2]=arg minf(X)(E[(YE[YX])2]+E[(E[YX]f(X))2]+2E[(YE[YX])(E[YX]f(X))]) \begin{align*} & \argmin_{f(X)} E\left[ (Y - f(X))^{2} \right] \\ &= \argmin_{f(X)} E\left[ \Big( (Y - E\left[ Y | X \right]) + (E\left[ Y | X \right] - f(X)) \Big)^{2} \right] \\ &= \argmin_{f(X)} \Big( E\left[ (Y - E\left[ Y | X \right])^{2} \right] + E\left[ (E\left[ Y | X \right] - f(X))^{2} \right] \\ &\qquad\qquad\qquad + 2 E\Big[ (Y - E\left[ Y | X \right])(E\left[ Y | X \right] - f(X)) \Big] \Big) \end{align*}

The first term is independent of f(X)f(X), so it can be rearranged as follows.

arg minf(X)E[(Yf(X))2]=arg minf(X)(E[(E[YX]f(X))2]2E[(YE[YX])(E[YX]f(X))]) \begin{align} & \argmin_{f(X)} E\left[ (Y - f(X))^{2} \right] \nonumber \\ &= \argmin_{f(X)} \Big( E\left[ (E\left[ Y | X \right] - f(X))^{2} \right] - 2 E\Big[ (Y - E\left[ Y | X \right])(E\left[ Y | X \right] - f(X)) \Big] \Big) \end{align}

Let’s compute the second term.

The smoothing property of conditional expectation (Pulling out known factors)(\text{Pulling out known factors})

[a] E[f(X)YX]=f(X)E[YX]E[f(X)Y | X] = f(X) E[Y | X]

The property of conditional expectation (Law of total expectation)(\text{Law of total expectation})

[b] EX[EY[YX=x]]=EY[Y]E_{X}\left[ E_{Y}\left[ Y | X = x \right] \right] = E_{Y}\left[ Y \right]

Applying the above properties to the second term of (3)(3), we get

E[(YE[YX])(E[YX]f(X))]=[b]E[E[(YE[YX])(E[YX]f(X))X]]=[a]E[(E[YX]f(X))E[(YE[YX])X]]=linearityE[(E[YX]f(X))(E[YX]E[E[YX]X])] \begin{align*} & E\Big[ (Y - E\left[ Y | X \right])(E\left[ Y | X \right] - f(X)) \Big] \\ &\overset{[b]}{=} E \left[ E\Big[ (Y - E\left[ Y | X \right])(E\left[ Y | X \right] - f(X)) \Big| X \Big] \right] \\ &\overset{[a]}{=} E \left[ (E\left[ Y | X \right] - f(X)) E\Big[ (Y - E\left[ Y | X \right]) \Big| X \Big] \right] \\ &\overset{\text{linearity}}{=} E \left[ (E\left[ Y | X \right] - f(X)) \left( E[ Y | X ] - E \left[ E\left[ Y | X \right] \Big| X \right] \right) \right] \end{align*}

From the formula above, the last term is calculated as E[E[YX]X]=[a]E[YX]E[1X]=E[YX]E \left[ E\left[ Y | X \right] \Big| X \right] \overset{[a]}{=} E\left[ Y | X \right] E \left[ 1 | X \right] = E\left[ Y | X \right] ,

E[(YE[YX])(E[YX]f(X))]=E[(E[YX]f(X))(E[YX]E[YX])]=E[(E[YX]f(X))0]=0 \begin{align*} & E\Big[ (Y - E\left[ Y | X \right])(E\left[ Y | X \right] - f(X)) \Big] \\ &= E \Big[ (E\left[ Y | X \right] - f(X)) \left( E[ Y | X ] - E\left[ Y | X \right] \right) \Big] \\ &= E \Big[ (E\left[ Y | X \right] - f(X)) \cdot 0 \Big] \\ &= 0 \end{align*}

Therefore, the second term of (3)(3) is 00 and it is rearranged as follows.

arg minf(X)E[(E[YX]f(X))2] \argmin_{f(X)} E\Big[ (E\left[ Y | X \right] - f(X))^{2} \Big]

Therefore,

arg minf(X)E[(Yf(X))2]=arg minf(X)E[(E[YX]f(X))2]=E[YX] \argmin_{f(X)} E\left[ (Y - f(X))^{2} \right] = \argmin_{f(X)} E\Big[ (E\left[ Y | X \right] - f(X))^{2} \Big] = E\left[ Y | X \right]