logo

중요도 샘플링 📂Machine Learning

중요도 샘플링

Overview1

Importance sampling is one of the Monte Carlo methods, and it is a sampling trick that can be used when approximating integrations (expectations) with finite sums.

Build-up

The probability that a value from a standard normal distribution exceeds zz is about 44.

412πez2/2dz3.167×105 \begin{equation} \int_{4}^{\infty} \dfrac{1}{\sqrt{2\pi}} e^{-z^{2}/2}dz \approx 3.167 \times 10^{-5} \end{equation}

The Julia code to calculate this integral using Monte Carlo integration is as follows.

using Distributions

n = 50000                      # sample size
z₀ = 4.0

for i in 1:5
    z = rand(Normal(), n)      # n-samples drawn from a standard normal distribution 
    est_MC = sum(z .> z₀)/n    # estmiates using Monte Carlo
    println(est_MC)
end

4.0e-5
6.0e-5
0.0
4.0e-5
4.0e-5

If you try it about five times, the calculated value doesn’t seem close to the true value, and sometimes it even calculates as 00. This inaccuracy arises because the tail of the standard normal distribution is too thin. Although integration on paper includes calculation for the tails regardless of how thin they are, when sampling with a computer, you only deal with a finite number of samples, so an event with a very low occurrence probability might not be included in the calculations. Of course, increasing nn will give more accuracy, but that’s too obvious to call a solution.

f(x)p(x)dx=f(x)p(x)q(x)q(x)dx=[f(x)p(x)q(x)]q(x)dx \int f(x)p(x) dx = \int f(x)\dfrac{p(x)}{q(x)} q(x) dx = \int \left[ f(x)\dfrac{p(x)}{q(x)} \right] q(x) dx

Now, consider a scenario where we calculate the expectation of f(X)f(X) for a random variable XX with a probability density function (PDF) of pp. It can be calculated as on the left side of the above expression, where multiplying and dividing by any q(x)q(x) still gives the same value. While the right side bears the same value as the left, it signifies the ’expectation of f(X)p(X)q(X)f(X)\dfrac{p(X)}{q(X)} for a random variable XX with a PDF of qq.’ If qq is chosen as a PDF with a thicker tail than pp, inaccuracies due to very low occurrence probability events can be reduced.

Definition

Let’s say that XX is a random variable with a PDF of pp. Assume qq to be another PDF that satisfies the following.

supppsuppq(    q0 if p0) \supp p \subset \supp q \qquad (\iff q\ne 0 \text{ if } p \ne 0)

Importance sampling refers to the method of computing the expectation Ep[f(X)]E_{p}[f(X)] of f(X)f(X) by calculating Eq[f(X)p(X)q(X)]E_{q}[f(X)\frac{p(X)}{q(X)}]. Here, supp\supp denotes the support of the function.

Ep[f]=f(x)p(x)dx=[f(x)p(x)q(x)]q(x)dx=Eq[fp/q] E_{p}[f] = \int f(x)p(x) dx = \int \left[ f(x)\dfrac{p(x)}{q(x)} \right] q(x) dx = E_{q}[fp/q]

In this context, qq is called the proposal distribution, and w=pqw = \frac{p}{q} is termed the importance weight.

Explanation

In other words, importance sampling calculates Monte Carlo integration as follows.

1nf(xi)p(xi)q(xi),Xiq(x)(instead of 1nf(xi),Xip(x)) \dfrac{1}{n}\sum f(x_{i})\dfrac{p(x_{i})}{q(x_{i})}, X_{i} \sim q(x) \quad \left( \text{instead of } \dfrac{1}{n}\sum f(x_{i}), X_{i} \sim p(x) \right)

It’s helpful when used in situations like the examples in the build-up:

  • When the tail of p(x)p(x) is too thin, propose a distribution with a larger support as a proposal function.
  • When it is difficult to sample from the distribution of p(x)p(x), set a distribution easy to sample from as the proposal function q(x)q(x).

Since qq is multiplied in the denominator of pp, suppq\supp q must include suppp\supp p.

The result of recalculating the integral (1)(1) from the build-up using the Cauchy distribution as the proposal distribution and importance sampling is as follows.

3516_IS.png

# Julia code

using Distributions

n = 50000                      # sample size
z₀ = 4.0

for i in 1:5
    z = rand(Cauchy(), n)                       # n-samples drawn from a Cauchy distribution 
    w = pdf.(Normal(), z) ./ pdf.(Cauchy(), z)
    est_IS = sum(w[z .> z₀])/n                  # estmiates using Importance Sampling
    println(est_IS)
end

3.0330744703037005e-5
3.1808448538011614e-5
3.049259883894785e-5
3.305423127141489e-5
3.1807695818207125e-5

Properties

Unbiased Estimator

1nf(xi)p(xi)q(xi)\dfrac{1}{n}\sum\limits f(x_{i})\dfrac{p(x_{i})}{q(x_{i})} is an unbiased estimator of Ep[f(X)]E_{p}[f(X)].

Proof

Eq[1ninf(Xi)p(Xi)q(Xi)]=1ninEq[f(Xi)p(Xi)q(Xi)]by linearity of E=Eq[f(X)p(X)q(X)]=Ep[f(X)] \begin{align*} E_{q}\left[ \frac{1}{n}\sum_{i}^{n}f(X_{i})\dfrac{p(X_{i})}{q(X_{i})} \right] &= \frac{1}{n}\sum_{i}^{n} E_{q}\left[ f(X_{i})\dfrac{p(X_{i})}{q(X_{i})} \right] &\text{by linearity of EE} \\ &= E_{q}\left[ f(X)\dfrac{p(X)}{q(X)} \right] \\ &= E_{p}\left[ f(X) \right] \end{align*}

분산

중요도샘플링에서의 분산은

Var[fpq]=Eq[(fpq)2](Eq[fpq])2=f(x)2p(x)2q(x)dx(f(x)p(x)dx)2 \begin{align} \Var \left[ \frac{fp}{q} \right] &= E_{q}\left[ \left( \frac{fp}{q} \right)^{2} \right] - \left( E_{q}\left[ \frac{fp}{q} \right] \right)^{2} \\ &= \int \frac{f(x)^{2}p(x)^{2}}{q(x)} dx - \left( \int f(x)p(x) dx \right)^{2} \nonumber \end{align}

이므로, 유한한 분산을 갖기 위해선 다음이 성립해야한다.

f(x)2p(x)2q(x)dx< \int \frac{f(x)^{2}p(x)^{2}}{q(x)} dx \lt \infty

This implies that p(x)/q(x)p(x)/q(x) must be bounded, which in turn means that the tail of q(x)q(x) should be thicker than the tail of p(x)p(x). The qq that minimizes the variance is as follows.

q=arg minqVar[fp/q]=f(x)p(x)f(x)p(x)dx q^{\ast} = \argmin\limits_{q} \Var [fp/q] = \frac{ \left| f(x) \right| p(x)}{ \int \left| f(x) \right| p(x) dx}

Proof

The second term of (2)(2) seems to depend on qq, but since Eq[fpq]=f(x)p(x)dxE_{q}\left[ \frac{fp}{q} \right] = {\displaystyle \int} f(x)p(x)dx, it is actually independent of qq.

arg minqVar[fp/q]=arg minqEq[(fpq)2] \argmin\limits_{q} \Var [fp/q] = \argmin\limits_{q} E_{q}\left[ \left( \frac{fp}{q} \right)^{2} \right]

Jensen’s Inequality

For a function ϕ\phi that is convex and twice differentiable on an open interval II, if a random variable XX has expectation μ\mu and XIX \subset I then ϕ[E(X)]E[ϕ(X)] \phi [ E(X) ] \le E [ \phi (X)]

However, since the quadratic function is convex,

(Eq[fpq])2Eq[(fpq)2]    (f(x)p(x)dx)2f(x)2p(x)2q(x)dx \begin{align*} && \left( E_{q}\left[ \frac{fp}{q} \right] \right)^{2} &\le E_{q}\left[ \left( \frac{fp}{q} \right)^{2} \right] \\ \implies && \left( \int f(x)p(x) dx \right)^{2} &\le \int \frac{f(x)^{2} p(x)^{2}}{q(x)} dx \end{align*}

The equality holds if q(x)=f(x)p(x)f(x)p(x)dxq(x) = \frac{ \left| f(x) \right| p(x)}{ \int \left| f(x) \right| p(x) dx}.


  1. Christoper M. Bishop, Pattern Recognition annd Machine Learning (2006), p532-534 ↩︎