logo

Maximum Likelihood Estimation for Linear Regression Model in Machine Learning 📂Machine Learning

Maximum Likelihood Estimation for Linear Regression Model in Machine Learning

Summary

Assume the relationship between data xiRn\mathbf{x}_{i} \in \mathbb{R}^{n} and its labels yiRy_{i} \in \mathbb{R} is described by the following linear model.

yi=wTxi+ϵi,i=1,,K(1) y_{i} = \mathbf{w}^{\mathsf{T}} \mathbf{x}_{i} + \epsilon_{i}, \qquad i = 1, \ldots, K \tag{1}

When K>nK > n, the parameter wML\mathbf{w}_{\text{ML}} that maximizes the likelihood is as follows.

wML=(XTX)1XTy \mathbf{w}_{\text{ML}} = (\mathbf{X}^{\mathsf{T}} \mathbf{X})^{-1} \mathbf{X}^{\mathsf{T}} \mathbf{y}

Here, y=[y1yK]T\mathbf{y} = \begin{bmatrix} y_{1} & \cdots & y_{K} \end{bmatrix}^{\mathsf{T}} and X=[x1xK]TRK×n\mathbf{X} = \begin{bmatrix} \mathbf{x}_{1} & \cdots & \mathbf{x}_{K} \end{bmatrix}^{\mathsf{T}} \in \mathbb{R}^{K \times n}.

Explanation

In (1)(1), wRn\mathbf{w} \in \mathbb{R}^{n} are parameters and ϵiN(0,σ2)\epsilon_{i} \sim N(0, \sigma^{2}) is [Gaussian noise]. It is assumed that ϵi\epsilon_{i} follows N(0,σ2)N(0, \sigma^{2}), hence yi=wTxi+ϵiy_{i} = \mathbf{w}^{\mathsf{T}} \mathbf{x}_{i} + \epsilon_{i} follows N(wTxi,σ2)N(\mathbf{w}^{\mathsf{T}} \mathbf{x}_{i}, \sigma^{2}).

yiN(wTxi,σ2) y_{i} \sim N(\mathbf{w}^{\mathsf{T}} \mathbf{x}_{i}, \sigma^{2})

Maximum likelihood estimation is finding the wML\mathbf{w}_{\text{ML}} that satisfies the following.

wML=arg maxwp(yw,X) \mathbf{w}_{\text{ML}} = \argmax_{\mathbf{w}} p(\mathbf{y} | \mathbf{w}, \mathbf{X})

The likelihood function for yiy_{i} and y\mathbf{y} in terms of w\mathbf{w} is as follows.

p(yiw,xi)=12πσ2exp[(yiwTxi)22σ2] p(y_{i} | \mathbf{w}, \mathbf{x}_{i}) = \dfrac{1}{\sqrt{2\pi \sigma^{2}}}\exp \left[ -\dfrac{(y_{i} - \mathbf{w}^{\mathsf{T}} \mathbf{x}_{i})^{2}}{2\sigma^{2}} \right]

p(yw,X)=i=1Kp(yiw,xi)=i=1K12πσ2exp[(yiwTxi)22σ2]=1(2πσ2)K/2exp[12σ2i=1K(yiwTxi)2]=1(2πσ2)K/2exp[12σ2yXw22] \begin{align*} p(\mathbf{y} | \mathbf{w}, \mathbf{X}) &= \prod_{i=1}^{K} p(y_{i} | \mathbf{w}, \mathbf{x}_{i}) \\ &= \prod_{i=1}^{K} \dfrac{1}{\sqrt{2\pi \sigma^{2}}} \exp \left[ -\dfrac{(y_{i} - \mathbf{w}^{\mathsf{T}} \mathbf{x}_{i})^{2}}{2\sigma^{2}} \right] \\ &= \dfrac{1}{(2\pi \sigma^{2})^{K/2}} \exp \left[ -\dfrac{1}{2\sigma^{2}} \sum_{i=1}^{K} (y_{i} - \mathbf{w}^{\mathsf{T}} \mathbf{x}_{i})^{2} \right] \\ &= \dfrac{1}{(2\pi \sigma^{2})^{K/2}} \exp \left[ -\dfrac{1}{2\sigma^{2}} \| \mathbf{y} - \mathbf{X}\mathbf{w} \|_{2}^{2} \right] \end{align*}

Since the likelihood is expressed as an exponential function, considering the log likelihood is convenient for computation.

wML=arg maxwlogp(yw,X)=arg maxw1(2πσ2)K/2(12σ2yXw22)=arg maxw(yXw22)=arg minwyXw22 \begin{align*} \mathbf{w}_{\text{ML}} &= \argmax_{\mathbf{w}} \log p(\mathbf{y} | \mathbf{w}, \mathbf{X}) \\ &= \argmax_{\mathbf{w}} \dfrac{1}{(2\pi \sigma^{2})^{K/2}} \left( -\dfrac{1}{2\sigma^{2}} \| \mathbf{y} - \mathbf{X}\mathbf{w} \|_{2}^{2} \right) \\ &= \argmax_{\mathbf{w}} (-\| \mathbf{y} - \mathbf{X}\mathbf{w} \|_{2}^{2}) \\ &= \argmin_{\mathbf{w}} \| \mathbf{y} - \mathbf{X}\mathbf{w} \|_{2}^{2} \end{align*}

According to the least squares method, wML\mathbf{w}_{\text{ML}} is as follows.

wML=(XTX)1XTy \mathbf{w}_{\text{ML}} = (\mathbf{X}^{\mathsf{T}} \mathbf{X})^{-1} \mathbf{X}^{\mathsf{T}} \mathbf{y}

See Also