logo

중요도 샘플링 📂Machine Learning

중요도 샘플링

Overview1

Importance sampling is one of the Monte Carlo methods, and it is a sampling trick that can be used when approximating integrations (expectations) with finite sums.

Build-up

The probability that a value from a standard normal distribution exceeds $z$ is about $4$.

$$ \begin{equation} \int_{4}^{\infty} \dfrac{1}{\sqrt{2\pi}} e^{-z^{2}/2}dz \approx 3.167 \times 10^{-5} \end{equation} $$

The Julia code to calculate this integral using Monte Carlo integration is as follows.

using Distributions

n = 50000                      # sample size
z₀ = 4.0

for i in 1:5
    z = rand(Normal(), n)      # n-samples drawn from a standard normal distribution 
    est_MC = sum(z .> z₀)/n    # estmiates using Monte Carlo
    println(est_MC)
end

4.0e-5
6.0e-5
0.0
4.0e-5
4.0e-5

If you try it about five times, the calculated value doesn’t seem close to the true value, and sometimes it even calculates as $0$. This inaccuracy arises because the tail of the standard normal distribution is too thin. Although integration on paper includes calculation for the tails regardless of how thin they are, when sampling with a computer, you only deal with a finite number of samples, so an event with a very low occurrence probability might not be included in the calculations. Of course, increasing $n$ will give more accuracy, but that’s too obvious to call a solution.

$$ \int f(x)p(x) dx = \int f(x)\dfrac{p(x)}{q(x)} q(x) dx = \int \left[ f(x)\dfrac{p(x)}{q(x)} \right] q(x) dx $$

Now, consider a scenario where we calculate the expectation of $f(X)$ for a random variable $X$ with a probability density function (PDF) of $p$. It can be calculated as on the left side of the above expression, where multiplying and dividing by any $q(x)$ still gives the same value. While the right side bears the same value as the left, it signifies the ’expectation of $f(X)\dfrac{p(X)}{q(X)}$ for a random variable $X$ with a PDF of $q$.’ If $q$ is chosen as a PDF with a thicker tail than $p$, inaccuracies due to very low occurrence probability events can be reduced.

Definition

Let’s say that $X$ is a random variable with a PDF of $p$. Assume $q$ to be another PDF that satisfies the following.

$$ \supp p \subset \supp q \qquad (\iff q\ne 0 \text{ if } p \ne 0) $$

Importance sampling refers to the method of computing the expectation $E_{p}[f(X)]$ of $f(X)$ by calculating $E_{q}[f(X)\frac{p(X)}{q(X)}]$. Here, $\supp$ denotes the support of the function.

$$ E_{p}[f] = \int f(x)p(x) dx = \int \left[ f(x)\dfrac{p(x)}{q(x)} \right] q(x) dx = E_{q}[fp/q] $$

In this context, $q$ is called the proposal distribution, and $w = \frac{p}{q}$ is termed the importance weight.

Explanation

In other words, importance sampling calculates Monte Carlo integration as follows.

$$ \dfrac{1}{n}\sum f(x_{i})\dfrac{p(x_{i})}{q(x_{i})}, X_{i} \sim q(x) \quad \left( \text{instead of } \dfrac{1}{n}\sum f(x_{i}), X_{i} \sim p(x) \right) $$

It’s helpful when used in situations like the examples in the build-up:

  • When the tail of $p(x)$ is too thin, propose a distribution with a larger support as a proposal function.
  • When it is difficult to sample from the distribution of $p(x)$, set a distribution easy to sample from as the proposal function $q(x)$.

Since $q$ is multiplied in the denominator of $p$, $\supp q$ must include $\supp p$.

The result of recalculating the integral $(1)$ from the build-up using the Cauchy distribution as the proposal distribution and importance sampling is as follows.

3516_IS.png

# Julia code

using Distributions

n = 50000                      # sample size
z₀ = 4.0

for i in 1:5
    z = rand(Cauchy(), n)                       # n-samples drawn from a Cauchy distribution 
    w = pdf.(Normal(), z) ./ pdf.(Cauchy(), z)
    est_IS = sum(w[z .> z₀])/n                  # estmiates using Importance Sampling
    println(est_IS)
end

3.0330744703037005e-5
3.1808448538011614e-5
3.049259883894785e-5
3.305423127141489e-5
3.1807695818207125e-5

Properties

Unbiased Estimator

$\dfrac{1}{n}\sum\limits f(x_{i})\dfrac{p(x_{i})}{q(x_{i})}$ is an unbiased estimator of $E_{p}[f(X)]$.

Proof

$$ \begin{align*} E_{q}\left[ \frac{1}{n}\sum_{i}^{n}f(X_{i})\dfrac{p(X_{i})}{q(X_{i})} \right] &= \frac{1}{n}\sum_{i}^{n} E_{q}\left[ f(X_{i})\dfrac{p(X_{i})}{q(X_{i})} \right] &\text{by linearity of $E$} \\ &= E_{q}\left[ f(X)\dfrac{p(X)}{q(X)} \right] \\ &= E_{p}\left[ f(X) \right] \end{align*} $$

분산

중요도샘플링에서의 분산은

$$ \begin{align} \Var \left[ \frac{fp}{q} \right] &= E_{q}\left[ \left( \frac{fp}{q} \right)^{2} \right] - \left( E_{q}\left[ \frac{fp}{q} \right] \right)^{2} \\ &= \int \frac{f(x)^{2}p(x)^{2}}{q(x)} dx - \left( \int f(x)p(x) dx \right)^{2} \nonumber \end{align} $$

이므로, 유한한 분산을 갖기 위해선 다음이 성립해야한다.

$$ \int \frac{f(x)^{2}p(x)^{2}}{q(x)} dx \lt \infty $$

This implies that $p(x)/q(x)$ must be bounded, which in turn means that the tail of $q(x)$ should be thicker than the tail of $p(x)$. The $q$ that minimizes the variance is as follows.

$$ q^{\ast} = \argmin\limits_{q} \Var [fp/q] = \frac{ \left| f(x) \right| p(x)}{ \int \left| f(x) \right| p(x) dx} $$

Proof

The second term of $(2)$ seems to depend on $q$, but since $E_{q}\left[ \frac{fp}{q} \right] = {\displaystyle \int} f(x)p(x)dx$, it is actually independent of $q$.

$$ \argmin\limits_{q} \Var [fp/q] = \argmin\limits_{q} E_{q}\left[ \left( \frac{fp}{q} \right)^{2} \right] $$

Jensen’s Inequality

For a function $\phi$ that is convex and twice differentiable on an open interval $I$, if a random variable $X$ has expectation $\mu$ and $X \subset I $ then $$ \phi [ E(X) ] \le E [ \phi (X)] $$

However, since the quadratic function is convex,

$$ \begin{align*} && \left( E_{q}\left[ \frac{fp}{q} \right] \right)^{2} &\le E_{q}\left[ \left( \frac{fp}{q} \right)^{2} \right] \\ \implies && \left( \int f(x)p(x) dx \right)^{2} &\le \int \frac{f(x)^{2} p(x)^{2}}{q(x)} dx \end{align*} $$

The equality holds if $q(x) = \frac{ \left| f(x) \right| p(x)}{ \int \left| f(x) \right| p(x) dx}$.


  1. Christoper M. Bishop, Pattern Recognition annd Machine Learning (2006), p532-534 ↩︎