Algorithmic Gymnastics: Notes on rejection sampling and importance sampling

Scenario: we know a distribution in terms of its pdf $f(x)$ and we wish to sample from it.

Sampling from a pdf can be difficult even when we know the pdf formula. Given the closed form expression of the normal distribution pdf for example, there is no obvious way to generate samples from the distribution.

Approach 1: inverse transform sampling

From $f$ get the cdf $F$ and then invert it to $F^{-1}$. Then sample $U(0,1)$ and pipe into $F^{-1}$ to get values on the x-axis that are distributed as $f$.

Intuition: think of the cdf in your head. Pick values uniformly 0-1 on the y-axis, walk along a horizontal line until you hit the cdf, and take the x-coord.

This works well if you can compute $F^{-1}$. But this is often difficult, e.g. for the normal distribution.

Approach 2: rejection sampling

MC algorithm. Let $M$ be the maximum of $f$. Sample x uniformly in some finite interval on the x-axis. For each sample, sample $y \sim U(0,M)$. If $y < f(x)$, keep the x sample, otherwise reject. Pretty intuitive.

The obvious downside is when the desired distribution is very far from uniform, e.g. all its mass lies in a small interval. Then almost all our uniform samples will get rejected.

We can repair this by noting that instead of sampling $x$ uniformly, we can sample according to any *proposal distribution*, and we should choose one whose pdf $g$ looks similar to $f$. Choose $Q$ a scaling factor such that the subgraph of $Qg$ fully contains $f$ ($Q$ fulfils a similar role to $M$). Now repeatedly sample $x$ values from $g$, for each one sample $y \sim U(0,Qg(x))$ and keep the sample if $y < f(x)$.

Think hard about why this works for any proposed distribution -- it's not obvious. For intuition, in the uniform case it is as if we are sampling $(x,y)$ pairs uniformly from a 2D rectangle of the desired pdf, and keeping the samples that fall in the subgraph. In the non-uniform case, instead of a 2D rectangle, we are instead sampling uniformly from a wonky shape which contains the desired pdf. And sampling uniformly from the wonky shape is equivalent to sampling $x$ coords from the wonky shape itself (since short bits of the wonky shape are less likely to be uniformly sampled than tall bits), and then sampling along the vertical line from the bottom to the top of the wonky shape.

An easier way to think about this is: think about throwing darts at a rectangular board containing the pdf, and accepting samples that fall below the pdf. This certainly works, and is equivalent to sampling uniformly on the x and y. But the board does not have to be rectangular; it can be any shape as long as it contains the whole pdf, and if we sample from the shape uniformly and discard the points outside the pdf subgraph, that's equivalent to sampling the pdf subgraph uniformly. So say our wonky shape is itself a pdf (or a positive multiple of a pdf); then sampling uniformly from the wonky shape is equivalent to sampling an x coord according to the pdf of that wonky shape, and then sampling uniformly along the vertical to the top of that wonky shape.

See https://en.wikipedia.org/wiki/Rejection_sampling#Description

Importance sampling

Importance sampling is not actually a sampling method. It is a method of estimating a parameter of a distribution when it is difficult to sample from that distribution, or when we seek a lower-variance estimator than the naive mean estimator.

Suppose we have a rv $X$ distributed according to pdf $p$ and we wish to estimate $E[X]$. If we can sample from $p$, we can estimate $E[X] = \frac1n \sum_{i=1}^n x_i$ where $x_i$ are iid samples, and this is usually a pretty good estimate. But if we *cannot* sample from $p$, it is a bit trickier. One thing we can do is to estimate the integral using MC integration. Sample iid $u_1, \dots, u_n \sim U(a,b)$ where $[a,b]$ covers all (or at least most) of the support of $X$, then use

$$E[X] = \int x p(x) dx \approx \frac{b-a}{n} \sum_{i=1}^n u_i p(u_i)$$

Here we are estimating the area under a curve by sampling $x$ coords uniformly under the curve, taking the average of the heights of the curve at those $x$s to get an average height of the curve, and then multiplying that height by $b-a$ to get the area under the curve.

This is an unbiased estimator, but it doesn't have particularly low variance. Consider the case of a distribution where almost all the mass is around $[-0.01, 0.01]$. If we sample uniformly from $[-1,1]$, then most of our samples are taken in areas which don't have much impact on the estimate of the parameter, i.e. which aren't *important* to the estimation. This estimator would have very high variance. We'd be better off mostly sampling from a tighter interval around 0, by encouraging the sampling of more important values. This is the key.

In the estimate above, each $u_i$ is distributed according to a uniform pdf $q(x) = \frac{1}{b-a}$ supported on $[a,b]$. Rewrite it as

$$E_p[X] = \int xp(x) dx = \int x\frac{p(x)}{q(x)}q(x)dx \\= \frac{1}{q(x)} \int x p(x) q(x) dx = (b-a)\int xp(x)q(x) dx \\= (b-a)E_q[xp(x)] = \frac{b-a}{n} \sum_{i=1}^n u_i p(u_i)$$

This is mathematically how the MC integration works. We are evaluating the expectation of $X$ w.r.t. distribution $p$ by evaluating the expectation of $p(x)$ w.r.t. $q$, and then using the usual mean estimator for that expectation. But we could replace the uniform distribution with any other distribution! If the pdf of that distribution is $q(x)$ and its samples are $y_i$, our estimator is

$$E_p[X] = \frac1n \sum_{i=1}^n y_i \frac{p(y_i)}{q(y_i)}$$

So we can choose the distribution of $q$ to be close to the distribution of $p$. As long as $q$ is non-zero on the support of $p$, this works and is unbiased.

Intuitively, we are sampling from $p$ by instead sampling from $q$ and then *reweighting* those samples according to what we know about the distributions of $p$ and $q$. The ratio $p(x)/q(x)$ is called the *sampling ratio* and gives the importance of each sample of $q$ to the distribution $p$.

Even in cases where we can directly sample from the target distribution, importance sampling can give us a lower-variance estimate. For example to sample from $N(0,1)$ we can do $\hat \theta = \frac 1 n \sum_{i=1}^n x_i$ where the $x_i \sim N(0,1)$ are iid, satisfying $\text{var}(\hat \theta) = \frac{1}{n}$. But if we instead sampled from $N(0,\frac {1}{2})$, more of our samples would come from around the mean of $N(0,1)$, so the estimator would have lower variance. But to use importance sampling in this way, we need an a priori idea of the parameter value to know where to focus our search.

Effective sample size

We sample 30 elements independently from a population distributed with variance $\sigma^2$. Then the sample size $n$ is 30, and the estimator of the mean $\mu$ has variance $\sigma^2/n$. So the more elements we sample, the more likely it is that the estimator is close to the true mean.

Now suppose 20 of those elements were correlated to each other. Then the estimator of the mean will now have variance bigger than $\sigma^2/n$, meaning we are less confident that the estimator gives us an estimate close to the mean. Another way to think about this is that although our actual sample size was 30, the effective sample size $n_e$ will be somewhat smaller. The ESS $n_e$ is defined as the quantity such that $\sigma^2/n_e$ gives the variance of the estimator.

For instance if all elements of the sample are 100% correlated, then we only really have one datapoint from which to compute the mean, so $n_e=1$.

Rephrasing: given a sample, if all its elements are independent, then we can use the sample mean as a good estimate of the mean. But if the elements are not independent, then we lose confidence in the precision of that mean. The ESS tells us: "when it comes to calculating the variance of the mean estimator, your sample size might be $n$ but *effectively* your sample size is $n_e$".

Kish's ESS

Suppose our sample is independent but weighted. Take a sample {a,b} with weights {1,2}. Then it is as if we actually had three samples {a,b,b} where the last two are 100% correlated. If we were to take $(a+2b)/3$ as our estimator of weighted mean, the variance of this mean is $\frac 5 9 \sigma^2$, which is quite a bit smaller than the $\frac 1 3 \sigma ^2$ that we would get if we had three 1-weighted elements in the sample. So the effective sample size is not 3, in fact it's less than 2 (since we only have two distinct elements and we've unbalanced them), it is 9/5 = 1.8.

This value is given by Kish's formula for ESS, sum(weights)^2 / sum(square of each weight).

Algorithmic Gymnastics

Saturday, July 24, 2021

Notes on rejection sampling and importance sampling

Approach 1: inverse transform sampling

Approach 2: rejection sampling

Importance sampling

Effective sample size

Kish's ESS

No comments:

Post a Comment