Wednesday, July 28, 2021
Intuition for Lagrange multipliers with multiple constraints
Sunday, July 25, 2021
Notes on ordinary least squares
Vector and matrix derivatives
These are definitions. They are all intuitive. (I've picked one convention for the definitions. The other convention gives all results transposed.)
Taking the derivative of a scalar by a column vector is just taking the grad:
$$\frac{\mathrm{d}a}{\mathrm{d} \bf x} = \nabla a = \left[\frac{\partial a}{\partial x_1}\ \frac{\partial a}{\partial x_2} \dots \ \frac{\partial a }{\partial x_n}\right]^\top$$
Taking the derivative of a vector by a scalar is componentwise differentiation. The result is a row vector by convention.
$$\frac{\mathrm{d}\bf{x}}{\mathrm{d}a} = \left[\frac{\mathrm{d}x_1}{\mathrm{d}a}\ \frac{\mathrm{d}x_2}{\mathrm{d}a}\ \dots\frac{\mathrm{d}x_n}{\mathrm{d}a}\right]$$
These basic derivatives are easily obtained from the definitions:
$$\frac{\mathrm{d}}{\mathrm{d}\bf x}\mathbf{x}^\top b = b$$
$$\frac{\mathrm{d}}{\mathrm{d}\bf x}\mathbf{x}^\top \mathbf x = 2\mathbf x$$
$$\frac{\mathrm{d}}{\mathrm{d}\bf x}\mathbf{x}^\top B \mathbf x = 2B\mathbf x$$
Note: taking the derivative of a row vector by a column vector gives a Jacobian matrix. The denominator of the derivative ranges across rows and the numerator ranges across columns. This is consistent with the two definitions above.$$\frac{\mathrm{d} \mathbf y}{\mathrm{d}\bf x} = \begin{pmatrix}\frac{\mathrm d y_1}{\mathrm d x_1} & \dots & \frac{\mathrm d y_n}{\mathrm d x_1} \\ \vdots & \ddots & \vdots\\ \frac{\mathrm d y_1}{\mathrm d x_n} & \dots & \frac{\mathrm d y_n}{\mathrm d x_n} \end{pmatrix}$$
This yields an additional useful derivative, but we won't use it:
$$\frac{\mathrm{d}}{\mathrm{d}\bf x}\mathbf{x}^\top B = B$$
Also note that $\bf x^\top y = y^\top x$ and these are both scalars. Same goes for $\mathbf{x}^\top A \mathbf{y} = \mathbf{y}^\top A^\top \mathbf{x}$. But $\bf xy^\top \neq yx^\top$; both sides are matrices and the rhs is the transpose of the lhs.
Algebra
We have a matrix of features $X$ whose first column is 1s. We have a column vector of responses $y$. We wish to find a column vector of coefficients $\beta$ (the first entry of which represents the constant term) such that
$$X\beta \approx y$$
Specifically we wish to minimise the sum of squares of elements in the vector $y-X\beta$. Fortunately, sum of squares is simply
$$\begin{align*}(y-X\beta)^\top (y-X\beta) &= y^\top y - y^\top X\beta - \beta^\top X^\top y + \beta^\top X^\top X \beta\\ &= y^\top y - 2\beta^T X^\top y + \beta^\top X^\top X \beta \end{align*}$$
and we wish to minimise this quantity over $\beta$. It is quadratic in $\beta$ so it has one critical point, and the coefficient of $\beta^T\beta$ is positive so the critical point is a minimum. Taking the derivative w.r.t. $\beta$ using the rules above and setting equal to 0 gives
$$-2X^\top y + 2X^\top X \beta= 0$$
Rearrange to give the normal equation
$$\beta = (X^\top X)^{-1}X^\top y$$
Geometric derivation
Minimising the sum of squares $\|X \beta - y\|^2$ corresponds geometrically to choosing $\beta$ to minimise the Euclidean distance between two vectors. The $y$ vector is fixed. $\beta$ allows us to choose a linear combination of the column of $X$ as our other vector. This other vector lies in the column space of $X$ (i.e. its image when interpreted as a linear map), and this column space won't contain $y$ (if it did, that would mean $X$'s features can perfectly recreate the response for all samples).
The minimum distance between a vector and a plane is located at the orthogonal projection of the vector onto the plane. This point is $X\beta$ (for the optimal $\beta$), and so the residual vector $y-X\beta$ is orthogonal to the plane. This implies that $(y-X\beta) \cdot Xv$ is 0 for all vectors $v$ (since all such vectors lie on the plane), i.e. $(y-X\beta)^\top Xv = 0$ for all $v$. Since this holds for all $v$, then indeed $(y-X\beta)^\top X = 0$. This rearranges to give the normal equation.
$\hat y = X\beta = X(X^\top X)^{-1}X^\top y$ is the projection of $y$ onto the column space of $X$, and more generally the matrix $H = X(X^\top X)^{-1}X^\top$ projects a response vector to its corresponding prediction vector. It is known as the "hat matrix" since it "puts the hat on y", and it is a projection (i.e. idempotent) operator: $H^2 = H$.
Since $Hy=\hat y$, the diagonal entry $H_{ii}$ gives the weighting of the response $y_i$ in the prediction $\hat y_i$, compared to the weighting of other responses $y_j$. This is known as the leverage of the $i$th observation and is a number between 0 and 1 (follows from the idempotency and symmetry conditions of the hat matrix, see Wikipedia). It tells us: for this prediction on this sample, how much did the observed response of the sample play a role, compared to the responses on other samples? In a linear regression model, points at extreme $x$ values (which are not necessarily outliers!) have high leverage. If these points are outliers, it can cause problems. See https://online.stat.psu.edu/stat462/node/170/
Statistics
First, the variance of a random vector $x$ is defined to be the variance-covariance matrix $\text{Var}(x) = E[(x-E[x])(x-E[x])^\top]$.
Fix our dataset $X$ (i.e. it is a constant, not a random matrix). We suppose our dataset was generated by the model $y = X\beta + \epsilon$ where $y$ is a random vector, $\beta$ is an unknown parameter and $\epsilon \sim N(0,\sigma^2)$ is a random vector with parameter $\sigma$. This explains the usual regression assumptions of normal iid (implying homoscedastic) errors.
Now let $\hat \beta = (X^\top X)^{-1}X^\top y$ be an estimator of $\beta$. We would like to know its bias and its variance.
First, substitute $y$ into our expression for $\hat \beta$ to give (after expanding and simplifying)
$$\hat \beta = \beta + (X^\top X)^{-1}X^\top \epsilon$$
The only random variable here is $\epsilon$ which has mean 0, so
$$E[\hat \beta] = \beta$$
meaning our estimator is unbiased. We now compute the variance-covariance matrix of $\hat \beta$, which will be
$$\begin{align*}\text{Var}(\hat \beta) &=E[(\hat \beta - \beta)(\hat \beta - \beta)^\top] \\&= E[\left((X^\top X)^{-1}X^\top \epsilon\right)\left((X^\top X)^{-1} X^\top \epsilon\right)^\top]\\&= E[(X^\top X)^{-1}X^\top \epsilon \epsilon^\top X (X^\top X)^{-1}]\\&= (X^\top X)^{-1}X^\top E[\epsilon \epsilon^\top] X (X^\top X)^{-1}\\\end{align*}$$
Now $E[\epsilon \epsilon^\top]$ is a component-wise expected value of the matrix $\epsilon \epsilon^\top$. For all off-diagonal entries we have $E[\epsilon_i \epsilon_j] = E[\epsilon_i]E[\epsilon_j] = 0$ by independence. On diagonals we have entries $E[\epsilon_i^2]$. Now $\epsilon_i \sim N(0, \sigma^2)$ therefore $\sigma^2 = \text{Var}(\epsilon_i) = E[\epsilon_i^2] - E[\epsilon_i]^2$, hence $E[\epsilon_i^2] = \sigma^2$ (well isn't that a neat little trick!). So $E[\epsilon \epsilon^\top] = \sigma^2 I_n$, a diagonal matrix which hence commutes in multiplication. [N.B. you can also simply observe that $\epsilon \epsilon^\top$ is the variance-covariance matrix of $\epsilon$, simplifying the argument above.] Continuing,
$$\begin{align*} \text{Var}(\hat \beta) &= (X^\top X)^{-1}X^\top X \sigma^2 (X^\top X)^{-1}\\ &= \sigma^2 (X^\top X)^{-1} \end{align*}$$
What is this saying? The variance of a parameter estimate reflects our uncertainty about the value of that parameter; the parameter itself is a constant. Our estimator is a random vector and this is its covariance structure. All the covariances in the matrix are proportional to $\sigma^2$, which makes intuitive sense: the variance of the slope of the line of best fit is naturally proportional to the variance of the errors $\sigma^2$. But the covariances are also based on the inverse of an expression in $X$. The more samples contained in $X$, the larger the values of $X^\top X$, and so the smaller the inverse term in the covariance. This corresponds to the fact that the more samples we have, the more information (and hence the more certainty) we have about the parameters $\beta$. All the uncertainty in $\beta$ comes from us having to infer its value from our limited data. In the extreme case where the number of samples goes to infinity, there is no uncertainty left, and our $\beta$ can be read straight out of the data.
On the diagonal of this matrix are the variances of the regression coefficients $\hat \beta_i$. These are variances of the sampling distribution of the $\hat \beta_i$. We can take the square root of these to get the standard errors for the coefs.
We might also ask about the consistency of $\hat \beta$, i.e. whether $\hat \beta \to_p \beta$ as the dataset grows. This is the formal posing of the extreme case above. Answering the question is a bit tricky to do in the general case, and it involves reasoning formally about the distribution of $X$ which we have thus far avoided. $\hat \beta$ is indeed consistent, and the argument for this in short is that in the expression $\hat \beta = \beta + (X^\top X)^{-1}X^\top \epsilon$, the size of the $X^\top \epsilon$ can be cleverly bounded by the law of large numbers using the independence of the samples and the errors as the randomly sampled $X$ grows large. See 9.1 of http://cameron.econ.ucdavis.edu/e240a/asymptotic.pdf .
Simple OLS
Let $$X = \begin{pmatrix}1 & x_1 \\ \vdots & \vdots\\ 1 & x_n\end{pmatrix}, \quad y=(y_1\ \dots \ y_n)^\top, \quad \beta = (\beta_0 \ \beta_1)^\top$$
and write $\bar x = \frac 1 n \sum_i x_i$ and likewise for other vectors. (the $\beta$s should have hats on them but we are lazy now)
Then $$X^\top X = \begin{pmatrix}n & \sum_i x_i \\ \sum_i x_i &\sum_i x_i^2\end{pmatrix} = \begin{pmatrix}n & n\bar x \\ n \bar x & n\overline{x^2}\end{pmatrix}$$
$$(X^\top X)^{-1} = \frac{1}{n^2 \overline{x^2} - n^2 \bar x ^2} \begin{pmatrix}n\overline{x^2} & -n \bar x \\ -n \bar x & n\end{pmatrix} = \frac{1}{n(\overline{x^2} - \bar x^2)} \begin{pmatrix}\overline{x^2} & -\bar x \\ -\bar x & 1\end{pmatrix}$$
$$X^\top y = \begin{pmatrix}n\bar y \\ \sum_i x_i y_i\end{pmatrix} = \begin{pmatrix}\bar ny \\ n \overline{xy}\end{pmatrix}$$
Thus
$$\beta = \frac{1}{\overline{x^2} - \bar x^2} \begin{pmatrix}\overline{x^2}\bar y - \bar x \overline{xy} \\ -\bar x \bar y + \overline{xy} \end{pmatrix}$$
So the slope is $$\beta_1 = \frac{\overline{xy} - \bar x \bar y}{\overline{x^2} - \bar{x}^2} = \frac{\text{Cov}(x,y)}{\text{Var}(x)}$$
Recalling that $\text{Corr}(x,y) = \frac{\text{Cov}(x,y)}{\sqrt{\text{Var}(X)\text{Var}(Y)}}$, and letting $\sigma_x,\sigma_y$ be the standard deviations of $x,y$, we have
$$\beta_1 = \text{Corr}(x,y) \frac{\sigma_y}{\sigma_x}$$
which is an intuitive way of understanding the line of best fit in simple regression. The slope shows the correlation between the independent and the dependent variable, scaled appropriately by each one's standard deviation so it's in the right units. $\sigma_y, \sigma_x$ go in the numerator, denominator respectively to match the rise/run on the plot.
Saturday, July 24, 2021
Notes on entropy
A measure of the uncertainty of a distribution. What does this mean? When we sample from the distribution, how certain can we be of what we'll get?
Take a discrete probability distribution over outcomes A, B, C. Suppose it's 0.9, 0.05, 0.05. Then we are pretty certain that when we sample from it we'll get A. But if the distribution is 0.33, 0.33, 0.33, then we have very little certainty.
How to quantify this? Suppose we were encoding a message whose components were sampled from the distribution. We'd want to use short encodings for common components and longer encodings for rare components. The entropy of the distribution is the expected length of a component in the most efficient possible binary prefix-free encoding. Since "expected length" is a bit hard to conceptualise (since it involves two concepts, the probability distribution and the encoding), it might be easier to think of the entropy as the inverse concept to the *density* or the *compressibility* of a distribution. Low entropy means can transmit lots of samples in few bits. High entropy means takes lots of bits to transmit samples.
Indeed it can be shown that if a component is sampled with probability p, then its optimal length is $\log_2(1/p) = -\log_2(p)$. Let's check intuition:
- if a component is sampled w/ prob 1, then we don't encode it at all, because there's no uncertainty in the distribution. We know what the message will be without even receiving it.
- if a component is sampled w/ prob 0.5, it's pretty likely; we should probably spend $\log_2(2)$ = 1 bit on it.
- if a component is very unlikely, say probability 0.01, we are happy to spend a lot of bits on it because we'll almost never have to transmit it. The number of bits is inversely proportional to the prob. But we don't want to spend 1/0.01 = 100 bits on it, that would be absurd, and misses the tree structure that we have. We do want to spend $\log_2(1/0.01) \approx 7$ bits.
This is the role played by the negative log. Note that negative log just comes from a log law applied to $\log_2(1/p)$ which is the more intuitive formula, because it captures the inverse proportionality of length to probability, as well as the $\log_2$ that comes from the binary tree structure of prefix codes. It transforms probabilities into suggested encoding lengths.
To compute the average length of a code, we use the usual expected value formula over samples $x$ from the distribution, to give the formula for entropy of $p$:
$$H(p) = -\sum_x p(x) \log_2 p(x)$$
Cross entropy, KL divergence, mutual information
$$H(p,q) = -\sum_x q(x) \log_2 p(x)$$
What could this correspond to? Suppose we have two probability distributions over the same space. We come up with an optimal encoding for one space - but then we need to use that encoding for the other space. How many bits on average will we need for the expected component? I.e. what we are doing is randomly sampling a component from distribution $q$ and determining its length under the encoding that we came up with for distribution $p$. If it costs us more to use longer encodings, then cross-entropy is the cost of encoding a message from $Q$ under the optimal encoding scheme of $P$. If we are encoding $P$, then there is no additional cost, so $H(p,p) = H(p)$. N.B. Cross-entropy is not symmetric!
Why do we care? Cross-entropy can tell us how different two distributions are. If $p, q$ are very different, then we expect $H(p)$ to be very different to $H(p,q)$. Note that $H(p)$ and $H(q)$ might be exactly the same (e.g. because they assign the same actual probability values to their outcomes, even if they do so in a different way. p might assign 0.5, 0.25, 0.25 whereas q assigns 0.25, 0.5, 0.25), but still we'll see that the cross-entropy $H(p,q)$ is higher. Indeed, the difference $H(q,p) - H(p)$ is the Kullback-Leibler divergence, expressing the distance of a probability distribution $q$ compared to a reference distribution $p$. KL can be interpreted as the *additional cost* of encoding the distribution $Q$ under the encoding of $P$, compared to encoding the distribution of $P$ under $P$.
$$D_{KL}(P \| Q) = H(Q,P) - H(P) \\=- \sum_x P(x) \log_2(Q(x))+\sum_x P(x)\log_2(P(x)) - \\ = -\sum_x P(x) \log\frac{Q(x)}{P(x)}$$
Why do we care? As seen previously, we often want to compare two distributions in ML. For e.g. to evaluate how a multilabel classification performed on a single instance, we can do the cross-entropy of the ground truth distribution of labels for that instance (a one-hot encoding consisting of 0's and 1's) with the output of the activation function, which is a fuzzier distribution. We precisely want to quantify the distance between the network's output and the labels.
There is also a concept of *mutual information* between r.v.s $X,Y$, where $I(X;Y) = D_{KL}(P_{X,Y} \| P_X \otimes P_Y)$, the additional cost of encoding $X,Y$ as independent random variables when in fact their joint distribution is the product of marginals.
Summary: cross-entropy is a natural way to compare distributions. It measures in appropriate units, i.e. "multiplicative" units (or log of multiplicative units, really) - e.g. if one distribution gives p=1 and the other gives p=0.01, the difference in probability is huge, even if the absolute difference is only 0.99. And in fact this is penalised as log(100) = 6. If one distribution gave p=1 and the other gave p=0.001, this is penalised as -log(1000). Note that this is quite different to something like integrating the difference between the curves, which doesn't account for how much worse a p=0.0001 prediction is compared to a p=0.01 prediction when the label is 1.
Entropy of a probability distribution measures the length (or "cost", if we are paying per bit) of a message encoded in the optimal encoding for that distribution.
Cross-entropy measures the length (cost) of encoding a message from one distribution using the optimal encoding from another.
Kullback-Liebler divergence measures the *additional cost* of encoding a message from one distribution using the optimal encoding from another, compared to encoding a message from the second distribution.
Mutual information (MI) measures the information shared between two r.v.s by looking at the additional cost of encoding a message from their joint distribution as if they were independent variables, compared to encoding a message assuming that their joint distribution is the product of marginals.
Extra info: $h(x) = \log_2(1/p(x)) = -\log_2(p(x))$ is the *information* of event x, which we can think of as the "surprise" of x. If an unlikely event happens, this carries lots of information. So entropy is just the expected information of a distribution. "Expected surprise", or "expected uncertainty".
Notes on rejection sampling and importance sampling
Approach 1: inverse transform sampling
Approach 2: rejection sampling
Importance sampling
Effective sample size
Kish's ESS
Notes on covariance, contravariance, invariance
These are words used to describe data types (“elevateds”) composed of other data types (“constituents”).
- Covariance: the subtype relationship of constituents is conserved.
- Contravariance: the subtype relationship of constituents is reversed.
- Invariance: no subtype relationship conserved.
Covariance applies when the elevated only exposes instances of the constituent. e.g. when the elevated is a read-only list of constituents, or the elevated returns a constituent. e.g. a ConstList<Cat> is a subtype of ConstList<Animal> because anything that wants to consume an Animal is happy to receive Cats.
Contravariance applies when the elevated only consumes instances of the constituent. e.g. when the elevated is a function of the constituent. e.g. an Action<Animal> is a subtype of an Action<Cat> because anything that wants to send off a Cat is happy to send it to a service receiving Animals.
Invariance applies when the elevated consumes and exposes instances of the constituent. e.g. when the elevated is a mutable list of a constituent. e.g. List<Cat> is not a subtype of List<Animal> because the user might want to take the list and store a Dog in it, and List<Animal> is not a subtype of List<Cat> because the user expects the list to contain cats.
Then function types are contravariant in the inputs and covariant in the outputs. So Cat -> Animal has non-trivial subtypes Animal -> Cat, Cat -> Cat, Animal -> Animal.
Don’t try to think of this in terms of calls to the function itself. This is just a statement about the function type. When we say Animal -> Animal is a subtype of Cat -> Animal we don’t mean that we can substitute an Animal as the input when a Cat is asked for. We mean that we can substitute the function itself with an instance of the subtype. We are saying that in the expression Animal a = f(Cat), we can replace f with a Animal -> Animal (or indeed an Animal -> Cat or Cat -> Cat) and it’ll still work.