A measure of the uncertainty of a distribution. What does this mean? When we sample from the distribution, how certain can we be of what we'll get?
Take a discrete probability distribution over outcomes A, B, C. Suppose it's 0.9, 0.05, 0.05. Then we are pretty certain that when we sample from it we'll get A. But if the distribution is 0.33, 0.33, 0.33, then we have very little certainty.
How to quantify this? Suppose we were encoding a message whose components were sampled from the distribution. We'd want to use short encodings for common components and longer encodings for rare components. The entropy of the distribution is the expected length of a component in the most efficient possible binary prefix-free encoding. Since "expected length" is a bit hard to conceptualise (since it involves two concepts, the probability distribution and the encoding), it might be easier to think of the entropy as the inverse concept to the *density* or the *compressibility* of a distribution. Low entropy means can transmit lots of samples in few bits. High entropy means takes lots of bits to transmit samples.
Indeed it can be shown that if a component is sampled with probability p, then its optimal length is $\log_2(1/p) = -\log_2(p)$. Let's check intuition:
- if a component is sampled w/ prob 1, then we don't encode it at all, because there's no uncertainty in the distribution. We know what the message will be without even receiving it.
- if a component is sampled w/ prob 0.5, it's pretty likely; we should probably spend $\log_2(2)$ = 1 bit on it.
- if a component is very unlikely, say probability 0.01, we are happy to spend a lot of bits on it because we'll almost never have to transmit it. The number of bits is inversely proportional to the prob. But we don't want to spend 1/0.01 = 100 bits on it, that would be absurd, and misses the tree structure that we have. We do want to spend $\log_2(1/0.01) \approx 7$ bits.
This is the role played by the negative log. Note that negative log just comes from a log law applied to $\log_2(1/p)$ which is the more intuitive formula, because it captures the inverse proportionality of length to probability, as well as the $\log_2$ that comes from the binary tree structure of prefix codes. It transforms probabilities into suggested encoding lengths.
To compute the average length of a code, we use the usual expected value formula over samples $x$ from the distribution, to give the formula for entropy of $p$:
$$H(p) = -\sum_x p(x) \log_2 p(x)$$
Cross entropy, KL divergence, mutual information
$$H(p,q) = -\sum_x q(x) \log_2 p(x)$$
What could this correspond to? Suppose we have two probability distributions over the same space. We come up with an optimal encoding for one space - but then we need to use that encoding for the other space. How many bits on average will we need for the expected component? I.e. what we are doing is randomly sampling a component from distribution $q$ and determining its length under the encoding that we came up with for distribution $p$. If it costs us more to use longer encodings, then cross-entropy is the cost of encoding a message from $Q$ under the optimal encoding scheme of $P$. If we are encoding $P$, then there is no additional cost, so $H(p,p) = H(p)$. N.B. Cross-entropy is not symmetric!
Why do we care? Cross-entropy can tell us how different two distributions are. If $p, q$ are very different, then we expect $H(p)$ to be very different to $H(p,q)$. Note that $H(p)$ and $H(q)$ might be exactly the same (e.g. because they assign the same actual probability values to their outcomes, even if they do so in a different way. p might assign 0.5, 0.25, 0.25 whereas q assigns 0.25, 0.5, 0.25), but still we'll see that the cross-entropy $H(p,q)$ is higher. Indeed, the difference $H(q,p) - H(p)$ is the Kullback-Leibler divergence, expressing the distance of a probability distribution $q$ compared to a reference distribution $p$. KL can be interpreted as the *additional cost* of encoding the distribution $Q$ under the encoding of $P$, compared to encoding the distribution of $P$ under $P$.
$$D_{KL}(P \| Q) = H(Q,P) - H(P) \\=- \sum_x P(x) \log_2(Q(x))+\sum_x P(x)\log_2(P(x)) - \\ = -\sum_x P(x) \log\frac{Q(x)}{P(x)}$$
Why do we care? As seen previously, we often want to compare two distributions in ML. For e.g. to evaluate how a multilabel classification performed on a single instance, we can do the cross-entropy of the ground truth distribution of labels for that instance (a one-hot encoding consisting of 0's and 1's) with the output of the activation function, which is a fuzzier distribution. We precisely want to quantify the distance between the network's output and the labels.
There is also a concept of *mutual information* between r.v.s $X,Y$, where $I(X;Y) = D_{KL}(P_{X,Y} \| P_X \otimes P_Y)$, the additional cost of encoding $X,Y$ as independent random variables when in fact their joint distribution is the product of marginals.
Summary: cross-entropy is a natural way to compare distributions. It measures in appropriate units, i.e. "multiplicative" units (or log of multiplicative units, really) - e.g. if one distribution gives p=1 and the other gives p=0.01, the difference in probability is huge, even if the absolute difference is only 0.99. And in fact this is penalised as log(100) = 6. If one distribution gave p=1 and the other gave p=0.001, this is penalised as -log(1000). Note that this is quite different to something like integrating the difference between the curves, which doesn't account for how much worse a p=0.0001 prediction is compared to a p=0.01 prediction when the label is 1.
Entropy of a probability distribution measures the length (or "cost", if we are paying per bit) of a message encoded in the optimal encoding for that distribution.
Cross-entropy measures the length (cost) of encoding a message from one distribution using the optimal encoding from another.
Kullback-Liebler divergence measures the *additional cost* of encoding a message from one distribution using the optimal encoding from another, compared to encoding a message from the second distribution.
Mutual information (MI) measures the information shared between two r.v.s by looking at the additional cost of encoding a message from their joint distribution as if they were independent variables, compared to encoding a message assuming that their joint distribution is the product of marginals.
Extra info: $h(x) = \log_2(1/p(x)) = -\log_2(p(x))$ is the *information* of event x, which we can think of as the "surprise" of x. If an unlikely event happens, this carries lots of information. So entropy is just the expected information of a distribution. "Expected surprise", or "expected uncertainty".
No comments:
Post a Comment