Bayesian Inference in Machine Learning
Overview
Bayesian inference is a statistical method for estimating the distribution of parameters using prior knowledge and observed data based on Bayes’ theorem.
Explanation
Assume that a random variable $\mathbf{x}$ follows a probability distribution with parameter $\theta$. The purpose of Bayesian inference is to estimate the distribution of $\theta$ by examining the samples drawn from $\mathbf{x}$. The key point is not the value of $\theta$, but estimating the “distribution” of $\theta$. Given $\mathbf{x}$, the probability density function related to $\theta$ is the conditional probability density function $p(\theta | \mathbf{x})$, which according to Bayes’ theorem is as follows.
$$ p(\theta | \mathbf{x}) = \dfrac{p(\mathbf{x} | \theta) p(\theta)}{p(\mathbf{x})} $$ $$ \text{posterior} = \dfrac{\text{likelihood} \times \text{prior}}{\text{evidence}} $$
Here, the left side $p(\theta | \mathbf{x})$ that we seek is called the posterior probability (distribution). It refers to the probability concerning $\theta$ after $\mathbf{x}$ has been drawn, i.e., after the event has occurred.
The term $p(\mathbf{x} | \theta)$ on the right side is known as the likelihood.
The term $p(\theta)$ on the right side is called the prior probability (distribution). It represents the knowledge about $\theta$ before observing $\mathbf{x}$.
The denominator on the right, $p(\mathbf{x})$, is called the evidence.
Since the data’s distribution does not change, $p(\mathbf{x})$ remains constant. Thus, we obtain the following.
$$ p(\theta | \mathbf{x}) \propto p(\mathbf{x} | \theta) p(\theta) $$ $$ \text{posterior} \propto \text{likelihood} \times \text{prior} $$
Meanwhile, from the definition of the conditional probability density function, the following holds.
$$ \begin{align*} p(\theta | \mathbf{x}, \mathbf{y}) &= \dfrac{p(\mathbf{x}, \mathbf{y} | \theta) p(\theta)}{p(\mathbf{x}, \mathbf{y})} \\ &= \dfrac{p(\mathbf{x}, \mathbf{y} | \theta) p(\theta)}{p(\mathbf{x}, \mathbf{y})} \dfrac{p(\mathbf{y})}{p(\mathbf{y})} \left( = \dfrac{{\color{royalblue}p(\mathbf{x}, \mathbf{y} | \theta)} p(\theta)}{\color{tomato}p(\mathbf{x}, \mathbf{y})} \dfrac{\color{tomato}p(\mathbf{y})}{\color{royalblue}p(\mathbf{y})} \right) \\ &= \dfrac{p(\mathbf{x} | \mathbf{y}, \theta) p(\theta)}{p(\mathbf{x} | \mathbf{y})} \left(= \dfrac{{\color{royalblue}p(\mathbf{x} | \mathbf{y}, \theta)} p(\theta)}{\color{tomato}p(\mathbf{x} | \mathbf{y})} \right) \\ \end{align*} $$
Maximum a Posteriori Estimation
Finding the $\theta$ that maximizes $p(\theta | \mathbf{x})$ is called maximum a posteriori estimation, or simply MAP. Since $p(\mathbf{x})$ does not depend on $\theta$, the $\theta_{\text{MAP}}$ that maximizes the posterior probability is as follows.
$$ \begin{align*} \theta_{\text{MAP}} &= \argmax_{\theta} p(\theta | \mathbf{x}) \\ &= \argmax_{\theta} p(\mathbf{x} | \theta) p(\theta) \end{align*} $$
Furthermore, since the logarithmic function is a monotonically increasing function, it is equivalent to the form below.
$$ \begin{align*} \theta_{\text{MAP}} &= \argmax_{\theta} p(\theta | \mathbf{x}) \\ &= \argmax_{\theta} p(\mathbf{x} | \theta) p(\theta) \\ &= \argmax_{\theta} \log [p(\theta | \mathbf{x})] \\ &= \argmax_{\theta} \log [p(\mathbf{x} | \theta)p(\theta)] \end{align*} $$
- Maximum a Posteriori Estimation of a Linear Regression Model
- Assuming the prior distribution is a uniform distribution, it is the same as the maximum likelihood estimation below.
- Assuming the prior distribution is a normal distribution, it corresponds to ridge regression.
- Assuming the prior distribution is a Laplace distribution, it corresponds to lasso regression.
Maximum Likelihood Estimation
In contrast to MAP, the estimation method that considers only the likelihood, without considering the prior probability, is called maximum likelihood estimation, or simply ML(E). The $\theta_{\text{ML}}$ that maximizes the likelihood of $\theta$ is as follows.
$$ \begin{align*} \theta_{\text{ML}} &= \argmax_{\theta} p(\mathbf{x} | \theta) \\ &= \argmax_{\theta} \log p(\mathbf{x} | \theta) \end{align*} $$
This is the same as assuming the prior probability to be a uniform distribution in the maximum a posteriori estimation.