기계학습에서 베이즈 추론 📂머신러닝

기계학습에서 베이즈 추론

개요

베이즈 추론^{Bayesian inference}이란, 베이즈 정리를 기반으로 사전 지식과 관측된 데이터를 통해 모수의 분포를 추정하는 통계적 방법이다.

설명

확률 변수 $\mathbf{x}$ 가 모수가 $\theta$ 인 어떤 확률분포를 따른다고 하자. 이때 $\mathbf{x}$ 에서 추출된 샘플을 보고 $\theta$ 의 분포를 추정하려는 것이 베이즈 추론의 목적이다. 여기서 $\theta$ 의 값이 아니라, $\theta$ 의 "분포"를 추정한다는 것이 핵심이다. $\mathbf{x}$ 가 주어졌을 때 $\theta$ 에 관한 확률밀도함수는 조건부 확률밀도함수 $p(\theta | \mathbf{x})$ 이다. 이는 베이즈 정리에 의해 아래와 같다.

$p(\theta | \mathbf{x}) = \dfrac{p(\mathbf{x} | \theta) p(\theta)}{p(\mathbf{x})}$ $\text{posterior} = \dfrac{\text{likelihood} \times \text{prior}}{\text{evidence}}$

여기서 구하고자하는 좌변의 $p(\theta | \mathbf{x})$ 를 사후 확률(분포)^{posterior probability (distribution)}이라 한다. $\mathbf{x}$ 가 추출된 후, 그러니까 사건이 일어난 후의 $\theta$ 에 관한 확률을 말한다.

우변의 $p(\mathbf{x} | \theta)$ 를 우도^{likelihood, 가능도}라 한다.

우변의 $p(\theta)$ 를 사전 확률(분포)^{prior probability (distribution)}이라 한다. $\mathbf{x}$ 를 관측하기 전에 $\theta$ 에 대한 지식을 나타낸다.

우변의 분모인 $p(\mathbf{x})$ 를 증거^evidence라 한다.

데이터가 따르는 분포는 변하지 않으므로 $p(\mathbf{x})$ 는 변하지 않는다. 따라서 다음을 얻는다.

$p(\theta | \mathbf{x}) \propto p(\mathbf{x} | \theta) p(\theta)$ $\text{posterior} \propto \text{likelihood} \times \text{prior}$

한편, 조건부 확률밀도함수의 정의에 의해 다음이 성립한다.

$\begin{align*} p(\theta | \mathbf{x}, \mathbf{y}) &= \dfrac{p(\mathbf{x}, \mathbf{y} | \theta) p(\theta)}{p(\mathbf{x}, \mathbf{y})} \\ &= \dfrac{p(\mathbf{x}, \mathbf{y} | \theta) p(\theta)}{p(\mathbf{x}, \mathbf{y})} \dfrac{p(\mathbf{y})}{p(\mathbf{y})} \left( = \dfrac{{\color{royalblue}p(\mathbf{x}, \mathbf{y} | \theta)} p(\theta)}{\color{tomato}p(\mathbf{x}, \mathbf{y})} \dfrac{\color{tomato}p(\mathbf{y})}{\color{royalblue}p(\mathbf{y})} \right) \\ &= \dfrac{p(\mathbf{x} | \mathbf{y}, \theta) p(\theta)}{p(\mathbf{x} | \mathbf{y})} \left(= \dfrac{{\color{royalblue}p(\mathbf{x} | \mathbf{y}, \theta)} p(\theta)}{\color{tomato}p(\mathbf{x} | \mathbf{y})} \right) \\ \end{align*}$

최대사후확률 추정

$p(\theta | \mathbf{x})$ 가 최대로 되는 $\theta$ 를 찾는 것 최대사후확률 추정^{maximum a posteriori estimation}, 간단히 MAP라 한다. $p(\mathbf{x})$ 는 $\theta$ 에 의존하지 않는 값이므로, 사후확률이 최대가 되는 $\theta_{\text{MAP}}$ 는 다음과 같다.

$\begin{align*} \theta_{\text{MAP}} &= \argmax_{\theta} p(\theta | \mathbf{x}) \\ &= \argmax_{\theta} p(\mathbf{x} | \theta) p(\theta) \end{align*}$

또한 로그 함수는 단조증가함수이므로, 아래 꼴과도 같다.

$\begin{align*} \theta_{\text{MAP}} &= \argmax_{\theta} p(\theta | \mathbf{x}) \\ &= \argmax_{\theta} p(\mathbf{x} | \theta) p(\theta) \\ &= \argmax_{\theta} \log [p(\theta | \mathbf{x})] \\ &= \argmax_{\theta} \log [p(\mathbf{x} | \theta)p(\theta)] \end{align*}$

선형회귀모델의 최대사후확률 추정
- 사전분포를 균등분포로 가정하면, 아래의 최대우도 추정과 같다.
- 사전분포를 정규분포로 가정하면, 리지 회귀와 같다.
- 사전분포를 라플라스분포로 가정하면, 라쏘 회귀와 같다.

최대우도 추정

MAP와 대조되는 개념으로, 사전확률을 고려하지 않고 우도만을 고려하는 추정 방법을 최대우도 추정^{maximum likelihood estimation}, 간단히 ML(E)라 한다. $\theta$ 의 우도가 최대가 되는 $\theta_{\text{ML}}$ 는 다음과 같다.

$\begin{align*} \theta_{\text{ML}} &= \argmax_{\theta} p(\mathbf{x} | \theta) \\ &= \argmax_{\theta} \log p(\mathbf{x} | \theta) \end{align*}$

이는 최대사후확률 추정에서 사전확률을 균등분포로 가정한 것과 같다.

선형회귀모델의 최대우도 추정