Algorithmic Gymnastics: Don't bound random integers with %

This will be a brief remark that will be the first in a series of posts about random number generation. Rather than discussing generation processes themselves (I'm no where near qualified enough to do that), I'll cover some notions of randomness, tests for randomness, and the role of those tests in software engineering.

You want to generate a random integer between 0 and $n$ inclusive. The usual advice is to do this:

num = random() % (n+1)

Where random() is some standard library function that outputs uniformly distributed random integers. This works. For most purposes, it works well. For cases where a uniform distribution is required, it's dangerous.

The problem is that random() outputs an integer less than some specific maximum. This maximum varies based on your programming language, so we'll refer to it as RAND_MAX – the name of the macro in C. On $k$-bit machines, RAND_MAX is often $2^{k-1} - 1$.

Assume for a second that RAND_MAX = 7, and you want a random number between 0 and 2 inclusive. You do num = random() % 3. Which outcomes of random() give num = 0? 0, 3, 6. What about num = 1? 1, 4, 7. What about num = 2? 2 and 5. Do you see the problem? Recall that the outputs of random() are uniformly randomly distributed, so the probability of num = 0 is $\frac37$, the probability that num = 1 is $\frac37$, but the probability that num = 2 is only $\frac27$. num is thus not uniformly distributed.

Obviously this effect is far less noticeable for larger RAND_MAX values, but it's present nonetheless. In cases where precision is a high priority, this bias is concerning.

So what's the right way? We can do:

do { num = random() } while (num > n)

but the expected number of iterations of this loop grows in inverse proportion to the size of $n$. Better to create a value $n'$ that's the lowest power of 2 not less than $n$, and instead do:

do { num = random() % n' } while (num > n)

This solution takes advantage of the fact that RAND_MAX is one less than a power of 2, so num % n' is uniformly distributed. For instance, in our previous example $n' = 4$, so num = random() % n' gives num = 0 when random() is 0 or 4, num = 1 when random() is 1 or 5, num = 2 when random() is 2 or 6, and num = 3 when random() is 3 or 7. This means that each possible outcome of num occurs with equal probability! However, if num = 3, we have to iterate again. Fortunately $n' < 2n$ so the expected number of iterations is less than 2 – a fair compromise for true uniform randomness.

A final word: I assume that your implementation of random() yields uniformly random integers. Standard library implementations of random() often do not satisfy this. To ensure uniform randomness, use a good random number generator.

In a future post, I'll discuss how to test the conformity of a black-box random number generator to the probability distribution it claims to choose from.

Algorithmic Gymnastics

Sunday, July 27, 2014

Don't bound random integers with %

No comments:

Post a Comment