Sunday, May 5, 2013

N Coloured Blobs

N coloured blobs wander around in groups on an open plain. Each group has its own distinct colour, and all coloured blobs in a group share the same colour. Occasionally, two groups will bump into each other: each of the blobs in the smaller group will take on the colour of the larger group, and the two groups will merge. If two groups of equal size bump into each other, the ’mergee’ is chosen at random.

A researcher monitors a coloured blob population. Initially, each blob is on its own (i.e. in a group of size 1) and has its own distinct colour. After a sufficient amount of time has passed, all the blobs have become part of one huge group of the same colour. Let K be the total number of times that a blob changed its colour.

A lower bound on K is N-1, if every blob is merged individually into the huge group, hence only changing colour once: from its initial colour to the colour of the huge group. Determine an upper bound on K that is as tight as possible.

This is a sweet little problem with a not terribly difficult solution, but it involves some interesting complexity analysis and allows us to derive a beautiful new implementation for the disjoint-set data structure.

Let's follow a single blob for its lifecycle and consider the number of times that it changes colour. Assume that at every merge, it merges with blob-set of the same size and takes on the other blob-set's colour.

After the first merge, its blob-set is size 2. After the second merge, its blob-set is size 4. We extrapolate to see that after the $k$th merge, its blob-set is size $2^k$; so if there were $N$ blobs initially, then a single blob can experience at most $\log_2 N$ merges and thus at most $\log_2 N$ colour changes. To give our final upper-bound, we assume that each blob experiences this number of colour changes, meaning that in the order of $O(N \log N)$ colour changes occur overall. (In fact, we can reduce the upper bound to $\frac{N \log_2 N}{2}$, but being a computer scientist, I'm not concerned about constant factors. seanieg89 from the boredofstudies.org forum proved this upper bound with some fancy maths, and that proof is available here.)

This demonstrates an interesting implementation of the disjoint-set data structure that I hadn't ever considered before. Give each set a label and keep track of the size of each set. When it comes time to merge two sets, overwrite the labels of the smaller set with those of the larger set. This is worst-case time $O(n \log n)$, as opposed to the union-find implementation which is marginally faster: $O(n \cdot a(n))$, where $a(n)$ is the inverse Ackermann function. However, in many of the places where we'd normally use disjoint sets, like Kruskal's algorithm, we need to sort stuff proportional to $n$ anyway, meaning the final complexity of the algorithm becomes $O(n \log n)$ anyway.

So our blob-merging operation, which has virtually no overhead and so is blisteringly fast, can actually replace the standard union-find algorithm in many situations! I've seen plenty of olympiad questions which require fast disjoint sets, but have additional complications which mean that the union-find algorithm doesn't work very nicely. In these situations, it might be preferable to merge some blobs as above.

No comments:

Post a Comment