logo

Bayesian Inference in Machine Learning 📂Machine Learning

Bayesian Inference in Machine Learning

Overview

Bayesian inference is a statistical method for estimating the distribution of parameters using prior knowledge and observed data based on Bayes’ theorem.

Explanation

Assume that a random variable $\mathbf{x}$ follows a probability distribution with parameter $\theta$. The purpose of Bayesian inference is to estimate the distribution of $\theta$ by examining the samples drawn from $\mathbf{x}$. The key point is not the value of $\theta$, but estimating the “distribution” of $\theta$. Given $\mathbf{x}$, the probability density function related to $\theta$ is the conditional probability density function $p(\theta | \mathbf{x})$, which according to Bayes’ theorem is as follows.

$$ p(\theta | \mathbf{x}) = \dfrac{p(\mathbf{x} | \theta) p(\theta)}{p(\mathbf{x})} $$ $$ \text{posterior} = \dfrac{\text{likelihood} \times \text{prior}}{\text{evidence}} $$

Here, the left side $p(\theta | \mathbf{x})$ that we seek is called the posterior probability (distribution). It refers to the probability concerning $\theta$ after $\mathbf{x}$ has been drawn, i.e., after the event has occurred.

The term $p(\mathbf{x} | \theta)$ on the right side is known as the likelihood.

The term $p(\theta)$ on the right side is called the prior probability (distribution). It represents the knowledge about $\theta$ before observing $\mathbf{x}$.

The denominator on the right, $p(\mathbf{x})$, is called the evidence.

Since the data’s distribution does not change, $p(\mathbf{x})$ remains constant. Thus, we obtain the following.

$$ p(\theta | \mathbf{x}) \propto p(\mathbf{x} | \theta) p(\theta) $$ $$ \text{posterior} \propto \text{likelihood} \times \text{prior} $$

Meanwhile, from the definition of the conditional probability density function, the following holds.

$$ \begin{align*} p(\theta | \mathbf{x}, \mathbf{y}) &= \dfrac{p(\mathbf{x}, \mathbf{y} | \theta) p(\theta)}{p(\mathbf{x}, \mathbf{y})} \\ &= \dfrac{p(\mathbf{x}, \mathbf{y} | \theta) p(\theta)}{p(\mathbf{x}, \mathbf{y})} \dfrac{p(\mathbf{y})}{p(\mathbf{y})} \left( = \dfrac{{\color{royalblue}p(\mathbf{x}, \mathbf{y} | \theta)} p(\theta)}{\color{tomato}p(\mathbf{x}, \mathbf{y})} \dfrac{\color{tomato}p(\mathbf{y})}{\color{royalblue}p(\mathbf{y})} \right) \\ &= \dfrac{p(\mathbf{x} | \mathbf{y}, \theta) p(\theta)}{p(\mathbf{x} | \mathbf{y})} \left(= \dfrac{{\color{royalblue}p(\mathbf{x} | \mathbf{y}, \theta)} p(\theta)}{\color{tomato}p(\mathbf{x} | \mathbf{y})} \right) \\ \end{align*} $$

Maximum a Posteriori Estimation

Finding the $\theta$ that maximizes $p(\theta | \mathbf{x})$ is called maximum a posteriori estimation, or simply MAP. Since $p(\mathbf{x})$ does not depend on $\theta$, the $\theta_{\text{MAP}}$ that maximizes the posterior probability is as follows.

$$ \begin{align*} \theta_{\text{MAP}} &= \argmax_{\theta} p(\theta | \mathbf{x}) \\ &= \argmax_{\theta} p(\mathbf{x} | \theta) p(\theta) \end{align*} $$

Furthermore, since the logarithmic function is a monotonically increasing function, it is equivalent to the form below.

$$ \begin{align*} \theta_{\text{MAP}} &= \argmax_{\theta} p(\theta | \mathbf{x}) \\ &= \argmax_{\theta} p(\mathbf{x} | \theta) p(\theta) \\ &= \argmax_{\theta} \log [p(\theta | \mathbf{x})] \\ &= \argmax_{\theta} \log [p(\mathbf{x} | \theta)p(\theta)] \end{align*} $$

Maximum Likelihood Estimation

In contrast to MAP, the estimation method that considers only the likelihood, without considering the prior probability, is called maximum likelihood estimation, or simply ML(E). The $\theta_{\text{ML}}$ that maximizes the likelihood of $\theta$ is as follows.

$$ \begin{align*} \theta_{\text{ML}} &= \argmax_{\theta} p(\mathbf{x} | \theta) \\ &= \argmax_{\theta} \log p(\mathbf{x} | \theta) \end{align*} $$

This is the same as assuming the prior probability to be a uniform distribution in the maximum a posteriori estimation.