Wednesday, July 28, 2021

Intuition for Lagrange multipliers with multiple constraints

This is an attempt to explain the method of Lagrange multipliers without resorting to arguments about tangency of level curves, which I don't find particularly helpful and which also don't work in the case of multiple constraints.

Task: maximise $f : \mathbb R^n \to \mathbb R$ subject to a number of constraints $g_i = 0$ for each $g_i : \mathbb R^n \to \mathbb R$.

For intuition, take $n=2, i=1$ so that we are maximising $f(x,y)$ subject to $g(x,y) = 0$. The first observation is that $g(x,y)=0$ describes a curve $S$ (specifically a level curve) on the XY-plane. We are walking along this curve and looking for the maximum value of $f$. By the fact of being a level curve of $g(x,y)$, the gradient $\nabla g$ is orthogonal to $S$ at all points on $S$. This is obvious: if the gradient wasn't orthogonal, then we could nudge ourselves along $S$ in the direction of the gradient and get to a higher value of $g$ than where we were before; but the value of $g$ does not change along $S$ by construction.

Now, consider the gradient $\nabla f$ along $S$. By the reasoning we just used, if $\nabla f$ is not orthogonal to $S$, then we can nudge ourselves along $S$ in the direction of the gradient (by "in the direction of the gradient", I mean we could project $\nabla f$ onto the tangent line at our current position on $S$, giving us a vector on that tangent line, then take a step along $S$ in that direction) and reach a higher value of $f$. At the point on $S$ where $f$ is maximal, there is no direction along $S$ where we can move in order to increase $f$, and so $\nabla f$ must be orthogonal to $S$.

At this point, both $\nabla f$ and $\nabla g$ are orthogonal to $S$. We can't say that $\nabla f = \nabla g$ because the gradients might have different magnitudes, or one might point in the opposite direction; but we don't care about the exact direction or the magnitudes of the gradients, we just care about the orthogonality condition. So the sought point $(x_0,y_0)$ satisfies $\nabla f(x_0,y_0) = \lambda \nabla g(x_0,y_0)$ for some constant $\lambda$, and it also satisfies $g(x_0,y_0) = 0$.

Now we will do a trick: we will artificially construct a function $L(x,y,\lambda)$ which, when we equate its gradient to 0, gives us the two conditions above. The first condition is $\lambda f - \lambda g = 0$, so one function that spits out this condition when its gradient is equated to 0 is $L(x,y, \lambda) = f(x,y) - \lambda g(x,y)$. And as a freebie, the third element of the $\nabla L$ vector, when equated to 0, gives the third condition $g(x,y) = 0$ – magic!

So the maximal point of $f(x,y)$ along $g(x,y)=0$ is one of the points where $\nabla L(x,y, \lambda) = 0$. There might be several such points, and some will be local minimums and some will be local maximums, but we can easily evaluate $f$ at each one and pick the maximum.

Multiple constraints


Now suppose we are maximising $f(x,y,z)$ subject to $g_1(x,y,z) = g_2(x,y,z) = 0$.

Things are more interesting now. For one, instead of one level curve we now have two level surfaces for $g_1,g_2$ respectively. Their gradients are still vectors, but they won't generally coincide. The maximum of $f$ doesn't necessarily lie at a point where the gradient of either $g_1$ or $g_2$ is 0.

It's difficult to reason about this setup in the abstract, so let's make this concrete. Picture the level surfaces $g_1=0, g_2=0$ as spheres in space, and imagine the two spheres intersect at the circle $g_1=g_2=0$, like two overlapping soap bubbles. Label the circle $S$.

Consider the gradients of $g_1,g_2$ at a point $p$ on $S$. The gradient of $g_1$ is a vector based at $p$ orthogonal to the surface of the first sphere, and the gradient of $g_2$ is a vector based at $p$ orthogonal to the second sphere. Both gradients are orthogonal to $S$ at $p$, and moreover, the hyperplane that passes through $p$ and intersects both vectors is orthogonal to $S$ at $p$. In visual terms, if we are an ant walking along $S$, the hyperplane looks like as a flat wall at $p$ that impedes our progress. Think hard about this before proceeding.

Since the hyperplane is orthogonal to $S$ at $p$, any vector $v$ based at $p$ lying on this hyperplane is also orthogonal to $S$ at $p$ (but note that it probably won't be orthogonal to either sphere!). The hyperplane is precisely the span of the two vectors, so $v$ is a linear combination of $\nabla g_1, \nabla g_2$.

Now what can we say about the maximality condition on $f$? If we are at a point on $S$ where the gradient of $f$ is not orthogonal to $S$, then once again we can nudge ourselves in the direction of the gradient to increase $f$. But if we are at a point $p$ where $f$ is maximal, then the gradient of $f$ will be orthogonal to $S$ at $p$. In other words, the gradient of $f$ will be some linear combination of $\nabla g_1, \nabla g_2$, since these two vectors span the hyperplane of orthogonal vectors to $S$ at $p$. Concretely, $\nabla f = \sum_i \lambda_i \nabla g_i$ for constants $\lambda_i$

Constructing our artificial function $L$ as before, but satisfying the new conditions we easily see that the right form is $L(x,y,z, \lambda_1, \lambda_2) = f(x,y,z) - \sum_i \lambda_i g_i(x,y,z)$. Equating the gradient $\nabla L$ to 0 gives us exactly the conditions we want: $\nabla f = \sum_i \lambda_i \nabla g_i$, and $g_i=0$ for all $i$.

Sunday, July 25, 2021

Notes on ordinary least squares

Vector and matrix derivatives

These are definitions. They are all intuitive. (I've picked one convention for the definitions. The other convention gives all results transposed.)

Taking the derivative of a scalar by a column vector is just taking the grad:

$$\frac{\mathrm{d}a}{\mathrm{d} \bf x} = \nabla a = \left[\frac{\partial a}{\partial x_1}\ \frac{\partial a}{\partial x_2} \dots \ \frac{\partial a }{\partial x_n}\right]^\top$$

Taking the derivative of a vector by a scalar is componentwise differentiation. The result is a row vector by convention.

$$\frac{\mathrm{d}\bf{x}}{\mathrm{d}a} = \left[\frac{\mathrm{d}x_1}{\mathrm{d}a}\ \frac{\mathrm{d}x_2}{\mathrm{d}a}\ \dots\frac{\mathrm{d}x_n}{\mathrm{d}a}\right]$$

These basic derivatives are easily obtained from the definitions:

$$\frac{\mathrm{d}}{\mathrm{d}\bf x}\mathbf{x}^\top b = b$$

$$\frac{\mathrm{d}}{\mathrm{d}\bf x}\mathbf{x}^\top \mathbf x = 2\mathbf x$$

$$\frac{\mathrm{d}}{\mathrm{d}\bf x}\mathbf{x}^\top B \mathbf x = 2B\mathbf x$$

Note: taking the derivative of a row vector by a column vector gives a Jacobian matrix. The denominator of the derivative ranges across rows and the numerator ranges across columns. This is consistent with the two definitions above.

$$\frac{\mathrm{d} \mathbf y}{\mathrm{d}\bf x} = \begin{pmatrix}\frac{\mathrm d y_1}{\mathrm d x_1} & \dots & \frac{\mathrm d y_n}{\mathrm d x_1} \\ \vdots & \ddots  & \vdots\\ \frac{\mathrm d y_1}{\mathrm d x_n} & \dots & \frac{\mathrm d y_n}{\mathrm d x_n} \end{pmatrix}$$

This yields an additional useful derivative, but we won't use it:

$$\frac{\mathrm{d}}{\mathrm{d}\bf x}\mathbf{x}^\top B = B$$

Also note that $\bf x^\top y = y^\top x$ and these are both scalars. Same goes for $\mathbf{x}^\top A \mathbf{y} = \mathbf{y}^\top A^\top \mathbf{x}$. But $\bf xy^\top \neq yx^\top$; both sides are matrices and the rhs is the transpose of the lhs.

Algebra

We have a matrix of features $X$ whose first column is 1s. We have a column vector of responses $y$. We wish to find a column vector of coefficients $\beta$ (the first entry of which represents the constant term) such that

$$X\beta \approx y$$

Specifically we wish to minimise the sum of squares of elements in the vector $y-X\beta$. Fortunately, sum of squares is simply

$$\begin{align*}(y-X\beta)^\top (y-X\beta) &= y^\top y - y^\top X\beta - \beta^\top X^\top y + \beta^\top X^\top X \beta\\ &= y^\top y - 2\beta^T X^\top y + \beta^\top X^\top X \beta \end{align*}$$

and we wish to minimise this quantity over $\beta$. It is quadratic in $\beta$ so it has one critical point, and the coefficient of $\beta^T\beta$ is positive so the critical point is a minimum. Taking the derivative w.r.t. $\beta$ using the rules above and setting equal to 0 gives

$$-2X^\top y + 2X^\top X  \beta= 0$$

Rearrange to give the normal equation

$$\beta = (X^\top X)^{-1}X^\top y$$

Geometric derivation

Minimising the sum of squares $\|X \beta - y\|^2$ corresponds geometrically to choosing $\beta$ to minimise the Euclidean distance between two vectors. The $y$ vector is fixed. $\beta$ allows us to choose a linear combination of the column of $X$ as our other vector. This other vector lies in the column space of $X$ (i.e. its image when interpreted as a linear map), and this column space won't contain $y$ (if it did, that would mean $X$'s features can perfectly recreate the response for all samples).

The minimum distance between a vector and a plane is located at the orthogonal projection of the vector onto the plane. This point is $X\beta$ (for the optimal $\beta$), and so the residual vector $y-X\beta$ is orthogonal to the plane. This implies that $(y-X\beta) \cdot Xv$ is 0 for all vectors $v$ (since all such vectors lie on the plane), i.e. $(y-X\beta)^\top Xv = 0$ for all $v$. Since this holds for all $v$, then indeed $(y-X\beta)^\top X = 0$. This rearranges to give the normal equation.

$\hat y = X\beta = X(X^\top X)^{-1}X^\top y$ is the projection of $y$ onto the column space of $X$, and more generally the matrix $H = X(X^\top X)^{-1}X^\top$ projects a response vector to its corresponding prediction vector. It is known as the "hat matrix" since it "puts the hat on y", and it is a projection (i.e. idempotent) operator: $H^2 = H$.

Since $Hy=\hat y$, the diagonal entry $H_{ii}$ gives the weighting of the response $y_i$ in the prediction $\hat y_i$, compared to the weighting of other responses $y_j$. This is known as the leverage of the $i$th observation and is a number between 0 and 1 (follows from the idempotency and symmetry conditions of the hat matrix, see Wikipedia). It tells us: for this prediction on this sample, how much did the observed response of the sample play a role, compared to the responses on other samples? In a linear regression model, points at extreme $x$ values (which are not necessarily outliers!) have high leverage. If these points are outliers, it can cause problems. See https://online.stat.psu.edu/stat462/node/170/ 

Statistics

First, the variance of a random vector $x$ is defined to be the variance-covariance matrix $\text{Var}(x) = E[(x-E[x])(x-E[x])^\top]$.

Fix our dataset $X$ (i.e. it is a constant, not a random matrix). We suppose our dataset was generated by the model $y = X\beta + \epsilon$ where $y$ is a random vector, $\beta$ is an unknown parameter and $\epsilon \sim N(0,\sigma^2)$ is a random vector with parameter $\sigma$. This explains the usual regression assumptions of normal iid (implying homoscedastic) errors.

Now let $\hat \beta = (X^\top X)^{-1}X^\top y$ be an estimator of $\beta$. We would like to know its bias and its variance.

First, substitute $y$ into our expression for $\hat \beta$ to give (after expanding and simplifying) 

$$\hat \beta = \beta + (X^\top X)^{-1}X^\top \epsilon$$

The only random variable here is $\epsilon$ which has mean 0, so

$$E[\hat \beta] = \beta$$

meaning our estimator is unbiased. We now compute the variance-covariance matrix of $\hat \beta$, which will be 

$$\begin{align*}\text{Var}(\hat \beta) &=E[(\hat \beta - \beta)(\hat \beta - \beta)^\top]  \\&= E[\left((X^\top X)^{-1}X^\top \epsilon\right)\left((X^\top X)^{-1} X^\top \epsilon\right)^\top]\\&= E[(X^\top X)^{-1}X^\top \epsilon \epsilon^\top X (X^\top X)^{-1}]\\&= (X^\top X)^{-1}X^\top E[\epsilon \epsilon^\top] X (X^\top X)^{-1}\\\end{align*}$$

Now $E[\epsilon \epsilon^\top]$ is a component-wise expected value of the matrix $\epsilon \epsilon^\top$. For all off-diagonal entries we have $E[\epsilon_i \epsilon_j] = E[\epsilon_i]E[\epsilon_j] = 0$ by independence. On diagonals we have entries $E[\epsilon_i^2]$. Now $\epsilon_i \sim N(0, \sigma^2)$ therefore $\sigma^2 = \text{Var}(\epsilon_i) = E[\epsilon_i^2] - E[\epsilon_i]^2$, hence $E[\epsilon_i^2] = \sigma^2$ (well isn't that a neat little trick!). So $E[\epsilon \epsilon^\top] = \sigma^2 I_n$, a diagonal matrix which hence commutes in multiplication. [N.B. you can also simply observe that $\epsilon \epsilon^\top$ is the variance-covariance matrix of $\epsilon$, simplifying the argument above.] Continuing, 

$$\begin{align*} \text{Var}(\hat \beta) &= (X^\top X)^{-1}X^\top X \sigma^2 (X^\top X)^{-1}\\ &= \sigma^2 (X^\top X)^{-1} \end{align*}$$

What is this saying? The variance of a parameter estimate reflects our uncertainty about the value of that parameter; the parameter itself is a constant. Our estimator is a random vector and this is its covariance structure. All the covariances in the matrix are proportional to $\sigma^2$, which makes intuitive sense: the variance of the slope of the line of best fit is naturally proportional to the variance of the errors $\sigma^2$. But the covariances are also based on the inverse of an expression in $X$. The more samples contained in $X$, the larger the values of $X^\top X$, and so the smaller the inverse term in the covariance. This corresponds to the fact that the more samples we have, the more information (and hence the more certainty) we have about the parameters $\beta$. All the uncertainty in $\beta$ comes from us having to infer its value from our limited data. In the extreme case where the number of samples goes to infinity, there is no uncertainty left, and our $\beta$ can be read straight out of the data.

On the diagonal of this matrix are the variances of the regression coefficients $\hat \beta_i$. These are variances of the sampling distribution of the $\hat \beta_i$. We can take the square root of these to get the standard errors for the coefs.

We might also ask about the consistency of $\hat \beta$, i.e. whether $\hat \beta \to_p \beta$ as the dataset grows. This is the formal posing of the extreme case above. Answering the question is a bit tricky to do in the general case, and it involves reasoning formally about the distribution of $X$ which we have thus far avoided. $\hat \beta$ is indeed consistent, and the argument for this in short is that in the expression $\hat \beta = \beta + (X^\top X)^{-1}X^\top \epsilon$, the size of the $X^\top \epsilon$ can be cleverly bounded by the law of large numbers using the independence of the samples and the errors as the randomly sampled $X$ grows large. See 9.1 of http://cameron.econ.ucdavis.edu/e240a/asymptotic.pdf .

Simple OLS

Let $$X = \begin{pmatrix}1 & x_1 \\ \vdots & \vdots\\ 1 & x_n\end{pmatrix}, \quad y=(y_1\ \dots \ y_n)^\top, \quad \beta = (\beta_0 \ \beta_1)^\top$$

and write $\bar x = \frac 1 n \sum_i x_i$ and likewise for other vectors. (the $\beta$s should have hats on them but we are lazy now)

Then $$X^\top X = \begin{pmatrix}n & \sum_i x_i \\ \sum_i x_i &\sum_i x_i^2\end{pmatrix} = \begin{pmatrix}n & n\bar x \\ n \bar x & n\overline{x^2}\end{pmatrix}$$

$$(X^\top X)^{-1} = \frac{1}{n^2 \overline{x^2} - n^2 \bar x ^2} \begin{pmatrix}n\overline{x^2} & -n \bar x \\ -n \bar x & n\end{pmatrix} = \frac{1}{n(\overline{x^2} - \bar x^2)} \begin{pmatrix}\overline{x^2} & -\bar x \\ -\bar x & 1\end{pmatrix}$$

$$X^\top y = \begin{pmatrix}n\bar y \\ \sum_i x_i y_i\end{pmatrix} = \begin{pmatrix}\bar ny \\ n \overline{xy}\end{pmatrix}$$

Thus

$$\beta = \frac{1}{\overline{x^2} - \bar x^2} \begin{pmatrix}\overline{x^2}\bar y - \bar x \overline{xy} \\ -\bar x \bar y + \overline{xy} \end{pmatrix}$$

So the slope is $$\beta_1 = \frac{\overline{xy} - \bar x \bar y}{\overline{x^2} - \bar{x}^2} = \frac{\text{Cov}(x,y)}{\text{Var}(x)}$$

Recalling that $\text{Corr}(x,y) = \frac{\text{Cov}(x,y)}{\sqrt{\text{Var}(X)\text{Var}(Y)}}$, and letting $\sigma_x,\sigma_y$ be the standard deviations of $x,y$, we have

$$\beta_1 = \text{Corr}(x,y) \frac{\sigma_y}{\sigma_x}$$

which is an intuitive way of understanding the line of best fit in simple regression. The slope shows the correlation between the independent and the dependent variable, scaled appropriately by each one's standard deviation so it's in the right units. $\sigma_y, \sigma_x$ go in the numerator, denominator respectively to match the rise/run on the plot.

Saturday, July 24, 2021

Notes on entropy

A measure of the uncertainty of a distribution. What does this mean? When we sample from the distribution, how certain can we be of what we'll get?

Take a discrete probability distribution over outcomes A, B, C. Suppose it's 0.9, 0.05, 0.05. Then we are pretty certain that when we sample from it we'll get A. But if the distribution is 0.33, 0.33, 0.33, then we have very little certainty.

How to quantify this? Suppose we were encoding a message whose components were sampled from the distribution. We'd want to use short encodings for common components and longer encodings for rare components. The entropy of the distribution is the expected length of a component in the most efficient possible binary prefix-free encoding. Since "expected length" is a bit hard to conceptualise (since it involves two concepts, the probability distribution and the encoding), it might be easier to think of the entropy as the inverse concept to the *density* or the *compressibility* of a distribution. Low entropy means can transmit lots of samples in few bits. High entropy means takes lots of bits to transmit samples.

Indeed it can be shown that if a component is sampled with probability p, then its optimal length is $\log_2(1/p) = -\log_2(p)$. Let's check intuition:

  • if a component is sampled w/ prob 1, then we don't encode it at all, because there's no uncertainty in the distribution. We know what the message will be without even receiving it.
  • if a component is sampled w/ prob 0.5, it's pretty likely; we should probably spend $\log_2(2)$ = 1 bit on it.
  • if a component is very unlikely, say probability 0.01, we are happy to spend a lot of bits on it because we'll almost never have to transmit it. The number of bits is inversely proportional to the prob. But we don't want to spend 1/0.01 = 100 bits on it, that would be absurd, and misses the tree structure that we have. We do want to spend $\log_2(1/0.01) \approx 7$ bits.

This is the role played by the negative log. Note that negative log just comes from a log law applied to $\log_2(1/p)$ which is the more intuitive formula, because it captures the inverse proportionality of length to probability, as well as the $\log_2$ that comes from the binary tree structure of prefix codes. It transforms probabilities into suggested encoding lengths. 

To compute the average length of a code, we use the usual expected value formula over samples $x$ from the distribution, to give the formula for entropy of $p$:

$$H(p) = -\sum_x p(x) \log_2 p(x)$$

Cross entropy, KL divergence, mutual information

$$H(p,q) = -\sum_x q(x) \log_2 p(x)$$

What could this correspond to? Suppose we have two probability distributions over the same space. We come up with an optimal encoding for one space - but then we need to use that encoding for the other space. How many bits on average will we need for the expected component? I.e. what we are doing is randomly sampling a component from distribution $q$ and determining its length under the encoding that we came up with for distribution $p$. If it costs us more to use longer encodings, then cross-entropy is the cost of encoding a message from $Q$ under the optimal encoding scheme of $P$. If we are encoding $P$, then there is no additional cost, so $H(p,p) = H(p)$. N.B. Cross-entropy is not symmetric!

Why do we care? Cross-entropy can tell us how different two distributions are. If $p, q$ are very different, then we expect $H(p)$ to be very different to $H(p,q)$. Note that $H(p)$ and $H(q)$ might be exactly the same (e.g. because they assign the same actual probability values to their outcomes, even if they do so in a different way. p might assign 0.5, 0.25, 0.25 whereas q assigns 0.25, 0.5, 0.25), but still we'll see that the cross-entropy $H(p,q)$ is higher. Indeed, the difference $H(q,p) - H(p)$ is the Kullback-Leibler divergence, expressing the distance of a probability distribution $q$ compared to a reference distribution $p$. KL can be interpreted as the *additional cost* of encoding the distribution $Q$ under the encoding of $P$, compared to encoding the distribution of $P$ under $P$.

$$D_{KL}(P \| Q) = H(Q,P) - H(P) \\=- \sum_x P(x) \log_2(Q(x))+\sum_x P(x)\log_2(P(x)) - \\ = -\sum_x P(x) \log\frac{Q(x)}{P(x)}$$

Why do we care? As seen previously, we often want to compare two distributions in ML. For e.g. to evaluate how a multilabel classification performed on a single instance, we can do the cross-entropy of the ground truth distribution of labels for that instance (a one-hot encoding consisting of 0's and 1's) with the output of the activation function, which is a fuzzier distribution. We precisely want to quantify the distance between the network's output and the labels.

There is also a concept of *mutual information* between r.v.s $X,Y$, where $I(X;Y) = D_{KL}(P_{X,Y} \| P_X \otimes P_Y)$, the additional cost of encoding $X,Y$ as independent random variables when in fact their joint distribution is the product of marginals.

Summary: cross-entropy is a natural way to compare distributions. It measures in appropriate units, i.e. "multiplicative" units (or log of multiplicative units, really) - e.g. if one distribution gives p=1 and the other gives p=0.01, the difference in probability is huge, even if the absolute difference is only 0.99. And in fact this is penalised as log(100) = 6. If one distribution gave p=1 and the other gave p=0.001, this is penalised as -log(1000). Note that this is quite different to something like integrating the difference between the curves, which doesn't account for how much worse a p=0.0001 prediction is compared to a p=0.01 prediction when the label is 1.

Entropy of a probability distribution measures the length (or "cost", if we are paying per bit) of a message encoded in the optimal encoding for that distribution.

Cross-entropy measures the length (cost) of encoding a message from one distribution using the optimal encoding from another.

Kullback-Liebler divergence measures the *additional cost* of encoding a message from one distribution using the optimal encoding from another, compared to encoding a message from the second distribution.

Mutual information (MI) measures the information shared between two r.v.s by looking at the additional cost of encoding a message from their joint distribution as if they were independent variables, compared to encoding a message assuming that their joint distribution is the product of marginals.

Extra info: $h(x) = \log_2(1/p(x)) = -\log_2(p(x))$ is the *information* of event x, which we can think of as the "surprise" of x. If an unlikely event happens, this carries lots of information. So entropy is just the expected information of a distribution. "Expected surprise", or "expected uncertainty".


Notes on rejection sampling and importance sampling

Scenario: we know a distribution in terms of its pdf $f(x)$ and we wish to sample from it. 

Sampling from a pdf can be difficult even when we know the pdf formula. Given the closed form expression of the normal distribution pdf for example, there is no obvious way to generate samples from the distribution.

Approach 1: inverse transform sampling


From $f$ get the cdf $F$ and then invert it to $F^{-1}$. Then sample $U(0,1)$ and pipe into $F^{-1}$ to get values on the x-axis that are distributed as $f$.

Intuition: think of the cdf in your head. Pick values uniformly 0-1 on the y-axis, walk along a horizontal line until you hit the cdf, and take the x-coord.

This works well if you can compute $F^{-1}$. But this is often difficult, e.g. for the normal distribution.

Approach 2: rejection sampling


MC algorithm. Let $M$ be the maximum of $f$. Sample x uniformly in some finite interval on the x-axis. For each sample, sample $y \sim U(0,M)$. If $y < f(x)$, keep the x sample, otherwise reject. Pretty intuitive.

The obvious downside is when the desired distribution is very far from uniform, e.g. all its mass lies in a small interval. Then almost all our uniform samples will get rejected.

We can repair this by noting that instead of sampling $x$ uniformly, we can sample according to any *proposal distribution*, and we should choose one whose pdf $g$ looks similar to $f$. Choose $Q$ a scaling factor such that the subgraph of $Qg$ fully contains $f$ ($Q$ fulfils a similar role to $M$). Now repeatedly sample $x$ values from $g$, for each one sample $y \sim U(0,Qg(x))$ and keep the sample if $y < f(x)$.

Think hard about why this works for any proposed distribution -- it's not obvious. For intuition, in the uniform case it is as if we are sampling $(x,y)$ pairs uniformly from a 2D rectangle of the desired pdf, and keeping the samples that fall in the subgraph. In the non-uniform case, instead of a 2D rectangle, we are instead sampling uniformly from a wonky shape which contains the desired pdf. And sampling uniformly from the wonky shape is equivalent to sampling $x$ coords from the wonky shape itself (since short bits of the wonky shape are less likely to be uniformly sampled than tall bits), and then sampling along the vertical line from the bottom to the top of the wonky shape.

An easier way to think about this is: think about throwing darts at a rectangular board containing the pdf, and accepting samples that fall below the pdf. This certainly works, and is equivalent to sampling uniformly on the x and y. But the board does not have to be rectangular; it can be any shape as long as it contains the whole pdf, and if we sample from the shape uniformly and discard the points outside the pdf subgraph, that's equivalent to sampling the pdf subgraph uniformly. So say our wonky shape is itself a pdf (or a positive multiple of a pdf); then sampling uniformly from the wonky shape is equivalent to sampling an x coord according to the pdf of that wonky shape, and then sampling uniformly along the vertical to the top of that wonky shape.

See https://en.wikipedia.org/wiki/Rejection_sampling#Description

Importance sampling


Importance sampling is not actually a sampling method. It is a method of estimating a parameter of a distribution when it is difficult to sample from that distribution, or when we seek a lower-variance estimator than the naive mean estimator.

Suppose we have a rv $X$ distributed according to pdf $p$ and we wish to estimate $E[X]$. If we can sample from $p$, we can estimate $E[X] = \frac1n \sum_{i=1}^n x_i$ where $x_i$ are iid samples, and this is usually a pretty good estimate. But if we *cannot* sample from $p$, it is a bit trickier. One thing we can do is to estimate the integral using MC integration. Sample iid $u_1, \dots, u_n \sim U(a,b)$ where $[a,b]$ covers all (or at least most) of the support of $X$, then use

$$E[X] = \int x p(x) dx \approx \frac{b-a}{n} \sum_{i=1}^n u_i p(u_i)$$

Here we are estimating the area under a curve by sampling $x$ coords uniformly under the curve, taking the average of the heights of the curve at those $x$s to get an average height of the curve, and then multiplying that height by $b-a$ to get the area under the curve.

This is an unbiased estimator, but it doesn't have particularly low variance. Consider the case of a distribution where almost all the mass is around $[-0.01, 0.01]$. If we sample uniformly from $[-1,1]$, then most of our samples are taken in areas which don't have much impact on the estimate of the parameter, i.e. which aren't *important* to the estimation. This estimator would have very high variance. We'd be better off mostly sampling from a tighter interval around 0, by encouraging the sampling of more important values. This is the key.

In the estimate above, each $u_i$ is distributed according to a uniform pdf $q(x) = \frac{1}{b-a}$ supported on $[a,b]$. Rewrite it as

$$E_p[X] = \int xp(x) dx = \int x\frac{p(x)}{q(x)}q(x)dx \\= \frac{1}{q(x)} \int x p(x) q(x) dx = (b-a)\int xp(x)q(x) dx \\= (b-a)E_q[xp(x)] = \frac{b-a}{n} \sum_{i=1}^n u_i p(u_i)$$

This is mathematically how the MC integration works. We are evaluating the expectation of $X$ w.r.t. distribution $p$ by evaluating the expectation of $p(x)$ w.r.t. $q$, and then using the usual mean estimator for that expectation. But we could replace the uniform distribution with any other distribution! If the pdf of that distribution is $q(x)$ and its samples are $y_i$, our estimator is

$$E_p[X] = \frac1n \sum_{i=1}^n y_i \frac{p(y_i)}{q(y_i)}$$

So we can choose the distribution of $q$ to be close to the distribution of $p$. As long as $q$ is non-zero on the support of $p$, this works and is unbiased.

Intuitively, we are sampling from $p$ by instead sampling from $q$ and then *reweighting* those samples according to what we know about the distributions of $p$ and $q$. The ratio $p(x)/q(x)$ is called the *sampling ratio* and gives the importance of each sample of $q$ to the distribution $p$.

Even in cases where we can directly sample from the target distribution, importance sampling can give us a lower-variance estimate. For example to sample from $N(0,1)$ we can do $\hat \theta = \frac 1 n \sum_{i=1}^n x_i$ where the $x_i \sim N(0,1)$ are iid, satisfying $\text{var}(\hat \theta) = \frac{1}{n}$. But if we instead sampled from $N(0,\frac {1}{2})$, more of our samples would come from around the mean of $N(0,1)$, so the estimator would have lower variance. But to use importance sampling in this way, we need an a priori idea of the parameter value to know where to focus our search.

Effective sample size


We sample 30 elements independently from a population distributed with variance $\sigma^2$. Then the sample size $n$ is 30, and the estimator of the mean $\mu$ has variance $\sigma^2/n$. So the more elements we sample, the more likely it is that the estimator is close to the true mean.

Now suppose 20 of those elements were correlated to each other. Then the estimator of the mean will now have variance bigger than $\sigma^2/n$, meaning we are less confident that the estimator gives us an estimate close to the mean. Another way to think about this is that although our actual sample size was 30, the effective sample size $n_e$ will be somewhat smaller. The ESS $n_e$ is defined as the quantity such that $\sigma^2/n_e$ gives the variance of the estimator.

For instance if all elements of the sample are 100% correlated, then we only really have one datapoint from which to compute the mean, so $n_e=1$.

Rephrasing: given a sample, if all its elements are independent, then we can use the sample mean as a good estimate of the mean. But if the elements are not independent, then we lose confidence in the precision of that mean. The ESS tells us: "when it comes to calculating the variance of the mean estimator, your sample size might be $n$ but *effectively* your sample size is $n_e$".

Kish's ESS


Suppose our sample is independent but weighted. Take a sample {a,b} with weights {1,2}. Then it is as if we actually had three samples {a,b,b} where the last two are 100% correlated. If we were to take $(a+2b)/3$ as our estimator of weighted mean, the variance of this mean is $\frac 5 9 \sigma^2$, which is quite a bit smaller than the $\frac 1 3 \sigma ^2$ that we would get if we had three 1-weighted elements in the sample. So the effective sample size is not 3, in fact it's less than 2 (since we only have two distinct elements and we've unbalanced them), it is 9/5 = 1.8.

This value is given by Kish's formula for ESS, sum(weights)^2 / sum(square of each weight).


Notes on covariance, contravariance, invariance

These are words used to describe data types (“elevateds”) composed of other data types (“constituents”).

  • Covariance: the subtype relationship of constituents is conserved.
  • Contravariance: the subtype relationship of constituents is reversed.
  • Invariance: no subtype relationship conserved.

Covariance applies when the elevated only exposes instances of the constituent. e.g. when the elevated is a read-only list of constituents, or the elevated returns a constituent. e.g. a ConstList<Cat> is a subtype of ConstList<Animal> because anything that wants to consume an Animal is happy to receive Cats.

Contravariance applies when the elevated only consumes instances of the constituent. e.g. when the elevated is a function of the constituent. e.g. an Action<Animal> is a subtype of an Action<Cat> because anything that wants to send off a Cat is happy to send it to a service receiving Animals.

Invariance applies when the elevated consumes and exposes instances of the constituent. e.g. when the elevated is a mutable list of a constituent. e.g. List<Cat> is not a subtype of List<Animal> because the user might want to take the list and store a Dog in it, and List<Animal> is not a subtype of List<Cat> because the user expects the list to contain cats.

Then function types are contravariant in the inputs and covariant in the outputs. So Cat -> Animal has non-trivial subtypes Animal -> CatCat -> CatAnimal -> Animal.

Don’t try to think of this in terms of calls to the function itself. This is just a statement about the function type. When we say Animal -> Animal is a subtype of Cat -> Animal we don’t mean that we can substitute an Animal as the input when a Cat is asked for. We mean that we can substitute the function itself with an instance of the subtype. We are saying that in the expression Animal a = f(Cat), we can replace f with a Animal -> Animal (or indeed an Animal -> Cat or Cat -> Cat) and it’ll still work.

Notes on quadratic forms, the Hessian, and positive definiteness


A quadratic form in two variables is a polynomial in two variables with all degrees 2: $f(x,y) = ax^2 + bxy + cy^2$. We can associate a two-variable quadratic form given in this form with the matrix $A = [[a, \frac12 b], [\frac12 b, c]]$ since $f(x,y) = (x,y)^\top A(x,y)$. This generalises naturally: an $n$-variable quadratic form corresponds to a unique symmetric $n\times n$ matrix.

$n$-dimensional quadratic forms are useful because they act as quadratic approximation terms in $n$ dimensions. To understand what this means, consider the Taylor expansion around $f(x)$ given by $f(x+c) =f(x) + f'(x)c + \frac12 f''(x)c^2 + \dots$ Each term here captures a property of the local behaviour of $f$. The $f(x)$  is a constant. The $f'(x)c$ term is a linear approximation without a constant term, since we've already included the constant term in the expansion. The $\frac12 f''(x)c^2$ is a quadratic approximation without a constant or linear term, since those have already been captured. And so on. The constant term captures only the constant behaviour. The linear term captures only the linear behaviour i.e. the behaviour of the slope. The quadratic term captures only the quadratic behaviour, i.e. the concavity.

How to visualise 2-variable quadratic forms? They are paraboloids, which can be elliptic (imagine a parabola rotated about its vertical, but not necessarily symmetric), hyperbolic (a saddle), or a parabolic cylinder (think of a half pipe).

We are given a function of two variables $f(x,y)$ and wish to approximate it at $f(x+a,y+b)$. The approximation is 

$$f(x+a,y+b) = f(x,y) + f_x(x,y)a + f_y(x,y)b + \tfrac12 f_{xx}(x,y)a^2 + f_{xy}(x,y)ab + \tfrac12 f_{yy}(x,y)b^2 + \dots$$

This is intuitive. We capture the constant behaviour first. Then for the linear behaviour, we capture the x and y behaviours separately and combine them linearly to describe the plane approximation. For the quadratic behaviour, we can't just capture x and y and combine them linearly, since the quadratic surface is more complex than a linear plane, and can't just be captured by combining two independent dimensions. Instead we capture the behaviour of the x and y directions separately and also of combining the x and y together.

Rewriting in vector form, we approximate $f(v+c)$ around $f(v)$ by

$$f(v+c) = f(v) + \nabla_f(v)\cdot c + \tfrac12 c^\top H_f(v)c + ...$$

where the Hessian is $H_f = [[f_{xx}, f_{xy}], [f_{yx}, f_{yy}]]$ and we evaluate this at $v$. The Hessian therefore represents the component of our approximation corresponding to "concavity information", in the same way that the grad captures "gradient/slope information". Neat.


Since partials commute for $f$ satisfying basic differentiability conditions, the Hessian is symmetric. Suppose the evaluated Hessian matrix is $[[1,-2], [-2,1]]$. What information about the concavity can we read off this matrix? $f_{xx}(v)$ and $f_{yy}(v)$ are both 1, so the double derivatives in the x and y directions are both positive. Is this enough to conclude that we are at a local minimum, and that whatever direction we head in will take us "upwards" (i.e. bending away from the tangent plane)?

Real symmetric matrices have real orthogonal eigenvectors. For our example Hessian, there is eigenvector $[1,1]$ with eigenvalue $-1$ and eigenvector $[-1,1]$ with eigenvalue $3$. Indeed we can diagonalise this Hessian using these eigenvectors/values. What it tells us is that if we head in the direction of the first eigenvector, the double derivative in this direction is -1, and if we head in the direction of the second eigenvector the double derivative is 3. So although the x and y directions both have positive double derivatives, we have found a direction $[1,1]$ where we don't have positive concavity. So we are at a saddle point, not a minimum.

Is there any other information about the concavity that's not present in its eigenvectors and eigenvalues? No! The eigenvectors and eigenvalues will fully describe the quadratic approximation at the point. Specifically the eigenvectors are the axes and the eigenvalues are the scalings. If we take level sets of this surface, we get hyperbolas with orthogonal axes, and the eigenvectors and eigenvalues give the axes positions and scalings.

In this example the Hessian at $v$ had a positive and a negative eigenvalue, so we concluded that $v$ is a saddle. Generalising: if the Hessian has two positive eigenvalues, then we are at a local minimum (i.e. the quadratic term in the Taylor series is an elliptic paraboloid), and any direction in which we move will take us upward. And if the Hessian has two negative eigenvalues, then we are at a local maximum.

And if we are working in more than two variables? Then we are at a local minimum if all the Hessian's eigenvalues are positive, and a local maximum if all the Hessian's eigenvalues are negative. But these are precisely the conditions for positive/negative definiteness!

So: positive definiteness is a condition on a matrix which means that its corresponding quadratic form is convex. Since the Hessian of a function at a point captures all the information about the function's convexity at that point, then the Hessian being positive definite at a point means the function is convex at that point. And if the Hessian is not positive or negative definite, then we can read off its eigenvectors and values to see the directions in which the function moves up or down.

And how does positive definiteness relate to quadratic optimisation? Take the function $f(X) = \tfrac 12 x^\top A x + x^\top b + c$. The second derivative of this function is $A$ everywhere. If $A$ is positive definite, then the second derivative is "positive" everywhere (i.e. from any starting point, whatever direction we go to will lead to the surface bending away from its tangent plane), and we know this from the eigenvalues all being positive. But this precisely means that $f$ is convex! This is the relationship between convex optimisation and positive semidefiniteness.