Convenience-Variance Trade-off
Definition
$$ \text{MSE} \left( \widehat{\theta} \right) = \operatorname{Var} \widehat{\theta} + \left( \text{Bias} \widehat{\theta} \right)^{2} $$
Description
Mean Squared Error (MSE) $\text{MSE}$ is frequently used as a measure for evaluating statistical models or as a loss function in machine learning, represented especially in terms of trade-offs between bias and variance. Handling bias may seem somewhat uncomfortable for a statistician. While dealing with variance feels almost tangible, based on the assumption of an appropriate probability distribution and the underlying mathematical theories, the square of bias, which essentially represents the discrepancy between the estimator and the true value, seems problematic as it leads to incorrect analysis by not controlling the bias.
However, as of 2021, with the remarkable advancement in machine learning techniques, we are in an era where it is considered acceptable to use techniques capable of reducing errors enough to compensate for the increase in bias in the sense of “prediction”. Although it varies by field, if machine learning is used more actively than classical statistical methods, there is no need to forsake a performance-enhancing method just because of an obsession with controlling bias. (Of course, if one’s background is in statistics, this can be quite unsatisfactory and disagreeable. I feel the same way.)
Proof
Strategy: It’s a commonly used method in statistics. By splitting $0 = E \widehat{\theta} - E \widehat{\theta}$ inside the brackets of the expected value into several terms and then arranging them in the form of variance and bias.
Since $E \widehat{\theta} - \theta$ is a constant, $$ \begin{align*} \text{MSE} \left( \widehat{\theta} \right) =& E \left[ \left( \widehat{\theta} - \theta \right)^{2} \right] \\ =& E \left[ \left( \widehat{\theta} - E \widehat{\theta} + E \widehat{\theta} - \theta \right)^{2} \right] \\ =& E \left[ \left( \widehat{\theta} - E \widehat{\theta} \right)^{2} + 2 \left( \widehat{\theta} - E \widehat{\theta} \right) \left( E \widehat{\theta} - \theta \right) + \left( E \widehat{\theta} - \theta \right)^{2} \right] \\ =& E \left[ \left( \widehat{\theta} - E \widehat{\theta} \right)^{2} \right] + 2 E \left[ \left( \widehat{\theta} - E \widehat{\theta} \right) \left( E \widehat{\theta} - \theta \right) \right] + E \left[ \left( E \widehat{\theta} - \theta \right)^{2} \right] \\ =& \operatorname{Var} \widehat{\theta} + 2 E \left[ \left( \widehat{\theta} - E \widehat{\theta} \right) \left( E \widehat{\theta} - \theta \right) \right] + \left( E \widehat{\theta} - \theta \right)^{2} \\ =& \operatorname{Var} \widehat{\theta} + 2 E \left[ \left( \widehat{\theta} - E \widehat{\theta} \right) \right] E \left[ \left( E \widehat{\theta} - \theta \right) \right] + \left( \text{Bias} \widehat{\theta} \right)^{2} \\ =& \operatorname{Var} \widehat{\theta} + 2 \left( E \widehat{\theta} - E \widehat{\theta} \right) E \left[ \left( E \widehat{\theta} - \theta \right) \right] + \left( \text{Bias} \widehat{\theta} \right)^{2} \\ =& \operatorname{Var} \widehat{\theta} + 0 + \left( \text{Bias} \widehat{\theta} \right)^{2} \end{align*} $$
■
Caution
Notice that in the proof process, the true value is not used when expressing variance. When evaluating the performance of the estimator $\widehat{\theta}$ through the Mean Squared Error, its variance does not serve as an indicator of how well it matches the true value but rather shows only how well the estimator estimates the real estimator.