Maximum Likelihood Estimator
Buildup
Consider a random variable $X$ with a probability density function (pdf) $f \left( x ; \theta \right)$ for parameter $\theta \in \Theta$. A random sample $X_{1} , \cdots , X_{n}$ drawn identically and independently (iid) from the same distribution as $X$ has the same pdf $f(x ; \theta)$ and realization $\mathbf{x} := \left( x_{1} , \cdots , x_{n} \right)$. The function $L$ defined for this is called the Likelihood Function. $$ L ( \theta ; \mathbf{x} ) := \prod_{k=1}^{n} f \left( x_{k} ; \theta \right) $$ As will be discussed below, since we are interested in the maximum value of this function, it is more convenient to represent it as $l$ by taking the logarithm to convert the product $\prod$ into the sum $\sum$. $$ l ( \theta ; \mathbf{x} ) := \sum_{k=1}^{n} \log f \left( x_{k} ; \theta \right) $$
Definition 1
The estimator $\hat{\theta} := \hat{\theta} \left( \mathbf{X} \right)$ that satisfies the following is called the Maximum Likelihood Estimator (MLE). $$ \hat{\theta} = \argmax L \left( \theta ; \mathbf{X} \right) $$
- $\mathbf{X}$ is a random vector $\mathbf{X} := \left( X_{1} , \cdots , X_{n} \right) $.
- $\argmax g$ is the argument that maximizes function $g$, which is the value that makes $g$ the largest.
Explanation
Intuition
Actually, ‘Likelihood’ is more intuitive in English, meaning “plausibility”.
For example, let’s assume that the heights of any three men picked off the street were measured to be 169cm, 171cm, and 182cm, and that the heights of Korean males follow a normal distribution $N \left( \mu , \sigma^{2} \right)$. Since the probability density function of the normal distribution $f (x; \mu)$ achieves its maximum value at the mean $x = \mu$, the product defined by $L \left( \theta ; \mathbf{x} \right)$ is most likely to be the largest when $\theta = \mu$.
Note that the main argument of function $L$ is not the data $\mathbf{x}$ but $\theta$. In other words, it is useful to imagine that $L$ is a function whose values do not change as $x$ is inserted into the pdf $f(x)$, but rather $f_{\theta}$ itself moves left and right according to $\theta$.
Since we do not know much about the properties of $L$, we cannot confidently say that the place where $L$ is the largest is $\theta = 171$, but it is definitely not $\theta = 182$. The terms likelihood and $\argmax$ may seem unfamiliar, but in essence, the Maximum Likelihood Estimator is simply “the most plausible value.”
Formulas
If $L$ is differentiable, the Maximum Likelihood Estimator satisfies the following Estimating Equation, which is a partial differential equation. $$ {{ \partial l ( \theta ) } \over { \partial \theta }} = 0 $$ This is merely an extension of the solution used to find the maximum value of a function using derivatives in the curriculum. However, this area might seem unfamiliar and intimidating in textbooks, especially for statistics students, who often do not deal with differential equations after their freshman year in college. In reality, you won’t have to solve differential equations and it’s okay not to know about them, so don’t worry too much.
Hogg et al. (2013). Introduction to Mathematical Statistics (7th Edition): p209, 329. ↩︎