logo

Bayesian Inference in Machine Learning 📂Machine Learning

Bayesian Inference in Machine Learning

Overview

Bayesian inference is a statistical method for estimating the distribution of parameters using prior knowledge and observed data based on Bayes’ theorem.

Explanation

Assume that a random variable x\mathbf{x} follows a probability distribution with parameter θ\theta. The purpose of Bayesian inference is to estimate the distribution of θ\theta by examining the samples drawn from x\mathbf{x}. The key point is not the value of θ\theta, but estimating the “distribution” of θ\theta. Given x\mathbf{x}, the probability density function related to θ\theta is the conditional probability density function p(θx)p(\theta | \mathbf{x}), which according to Bayes’ theorem is as follows.

p(θx)=p(xθ)p(θ)p(x) p(\theta | \mathbf{x}) = \dfrac{p(\mathbf{x} | \theta) p(\theta)}{p(\mathbf{x})} posterior=likelihood×priorevidence \text{posterior} = \dfrac{\text{likelihood} \times \text{prior}}{\text{evidence}}

Here, the left side p(θx)p(\theta | \mathbf{x}) that we seek is called the posterior probability (distribution). It refers to the probability concerning θ\theta after x\mathbf{x} has been drawn, i.e., after the event has occurred.

The term p(xθ)p(\mathbf{x} | \theta) on the right side is known as the likelihood.

The term p(θ)p(\theta) on the right side is called the prior probability (distribution). It represents the knowledge about θ\theta before observing x\mathbf{x}.

The denominator on the right, p(x)p(\mathbf{x}), is called the evidence.

Since the data’s distribution does not change, p(x)p(\mathbf{x}) remains constant. Thus, we obtain the following.

p(θx)p(xθ)p(θ) p(\theta | \mathbf{x}) \propto p(\mathbf{x} | \theta) p(\theta) posteriorlikelihood×prior \text{posterior} \propto \text{likelihood} \times \text{prior}

Meanwhile, from the definition of the conditional probability density function, the following holds.

p(θx,y)=p(x,yθ)p(θ)p(x,y)=p(x,yθ)p(θ)p(x,y)p(y)p(y)(=p(x,yθ)p(θ)p(x,y)p(y)p(y))=p(xy,θ)p(θ)p(xy)(=p(xy,θ)p(θ)p(xy)) \begin{align*} p(\theta | \mathbf{x}, \mathbf{y}) &= \dfrac{p(\mathbf{x}, \mathbf{y} | \theta) p(\theta)}{p(\mathbf{x}, \mathbf{y})} \\ &= \dfrac{p(\mathbf{x}, \mathbf{y} | \theta) p(\theta)}{p(\mathbf{x}, \mathbf{y})} \dfrac{p(\mathbf{y})}{p(\mathbf{y})} \left( = \dfrac{{\color{royalblue}p(\mathbf{x}, \mathbf{y} | \theta)} p(\theta)}{\color{tomato}p(\mathbf{x}, \mathbf{y})} \dfrac{\color{tomato}p(\mathbf{y})}{\color{royalblue}p(\mathbf{y})} \right) \\ &= \dfrac{p(\mathbf{x} | \mathbf{y}, \theta) p(\theta)}{p(\mathbf{x} | \mathbf{y})} \left(= \dfrac{{\color{royalblue}p(\mathbf{x} | \mathbf{y}, \theta)} p(\theta)}{\color{tomato}p(\mathbf{x} | \mathbf{y})} \right) \\ \end{align*}

Maximum a Posteriori Estimation

Finding the θ\theta that maximizes p(θx)p(\theta | \mathbf{x}) is called maximum a posteriori estimation, or simply MAP. Since p(x)p(\mathbf{x}) does not depend on θ\theta, the θMAP\theta_{\text{MAP}} that maximizes the posterior probability is as follows.

θMAP=arg maxθp(θx)=arg maxθp(xθ)p(θ) \begin{align*} \theta_{\text{MAP}} &= \argmax_{\theta} p(\theta | \mathbf{x}) \\ &= \argmax_{\theta} p(\mathbf{x} | \theta) p(\theta) \end{align*}

Furthermore, since the logarithmic function is a monotonically increasing function, it is equivalent to the form below.

θMAP=arg maxθp(θx)=arg maxθp(xθ)p(θ)=arg maxθlog[p(θx)]=arg maxθlog[p(xθ)p(θ)] \begin{align*} \theta_{\text{MAP}} &= \argmax_{\theta} p(\theta | \mathbf{x}) \\ &= \argmax_{\theta} p(\mathbf{x} | \theta) p(\theta) \\ &= \argmax_{\theta} \log [p(\theta | \mathbf{x})] \\ &= \argmax_{\theta} \log [p(\mathbf{x} | \theta)p(\theta)] \end{align*}

Maximum Likelihood Estimation

In contrast to MAP, the estimation method that considers only the likelihood, without considering the prior probability, is called maximum likelihood estimation, or simply ML(E). The θML\theta_{\text{ML}} that maximizes the likelihood of θ\theta is as follows.

θML=arg maxθp(xθ)=arg maxθlogp(xθ) \begin{align*} \theta_{\text{ML}} &= \argmax_{\theta} p(\mathbf{x} | \theta) \\ &= \argmax_{\theta} \log p(\mathbf{x} | \theta) \end{align*}

This is the same as assuming the prior probability to be a uniform distribution in the maximum a posteriori estimation.