Saturday, December 24, 2011

Quick 'n Dirty Disjoint Sets

The disjoint-set data structure is magical. Not only is it flexible and indispensable in a variety of situations, but both the associated algorithms and the implementation are remarkably clean and simple. This makes the disjoint-set data structure (sometimes called the 'union-find data structure', named after its two primary operations) a joy to encounter when programming.

The data structure is used to answer the question "given a set of bidirectional connections between nodes, can I reach node b from node a by walking along these connections?" It can be visualised as a set of disconnected trees where each tree regularly bumps deep nodes up to the top, where they become children of the root node. Each tree represents a set of connected nodes, so to determine whether we can walk between two nodes -- that is, whether the two nodes are part of the same set -- all we have to do is check if they have the same root. It's marginally more complicated than that, but that's the general gist of union-find. The data structure also allows quick merging of trees, meaning you can add new connections in near-constant time.

Being a tree data structure, it's relatively easy to implement using a pointer structure, where each node points to its parent. Here's a C implementation:

struct set {
    int rank;
    struct set *parent;
};

struct set* newSet(void) {
    struct set *s = (struct set*) malloc(sizeof(struct set));
    s->parent = s;
    s->rank = 0;
    return s;
}

struct set* find(struct set *s) {
    if (s->parent == s)  return s;
    else                 return (s->parent = find(s->parent));
}

void join(struct set *a, struct set *b) {
    struct set *aRep = find(a), *bRep = find(b);
    if (aRep->rank > bRep->rank) {
        bRep->parent = aRep;
    } else if (aRep->rank < bRep->rank) {
        aRep->parent = bRep;
    } else {
        aRep->parent = bRep;
        bRep->rank ++;
    }
}

It's short, sweet and understandable (I'm glaring at you, binary index trees): exactly the kind of thing you can code up in a programming competition. The rank property is used to implement union by rank, optimising the data structure's performance. Combined with path compression, this yields a time complexity of O(inverse Ackermann(n)) for all operations, which is below 5 for all practical values of n and so is effectively a constant time complexity.

But hey, that's still pretty long, and it needs, like, pointers and stuff. Why don't we drop union by rank? Although we lose the inverse Ackermann upper bound, path compression by itself keeps operations almost as fast... and who could resist the reduction in code size?

rep = range(0,100)

def find(p):
    if rep[p] != p:
        rep[p] = find(rep[p])
    return rep[p]
 
def union(p,q):
    rep[find(p)] = find(q)

This is almost the same thing, but with less code and more elegance (in my opinion, at least). I've been told that at least one university teaches union-find without union by rank, so it's also an acceptable implementation.

Here, I'm generating a set of random points on the plane and building a Euclidean minimal spanning tree (MST) from them, using Kruskal's algorithm and our very own disjoint-set data structure implementation. This code has not been tested for correctness.

import random, itertools

rep = range(0,100)

def find(p):
    if rep[p] != p:
        rep[p] = find(rep[p])
    return rep[p]
 
def union(p,q):
    rep[find(p)] = find(q)


# generate 100 random points on the plane
pts = [(random.randint(0,1000), random.randint(0,1000)) for i in xrange(100)]

# define Euclidean distance function
def dist(a,b):
    return ((pts[a][0]-pts[b][0])**2 + (pts[a][1]-pts[b][1])**2)**0.5

# generate pairs of points representing the edges of the complete graph
edges = list(itertools.combinations(range(100), 2))

# sort the edges by length, in ascending order
edges.sort(cmp=lambda (u1,v1),(u2,v2): -1 if dist(u1,v1) < dist(u2,v2) else 1)

# Kruskal's algorithm for MST
total_cost = 0
for (u,v) in edges:
    if find(u) != find(v):
        total_cost += dist(u,v)
        union(u,v)

print total_cost

It might not be as theoretically sound as the pointer implementation, especially considering that the max number of nodes is also constrained by the size of the rep array, but it's quick and dirty, and sometimes that's just what you need.

Tuesday, October 11, 2011

ASCII Circles

While browsing Reddit, I came across this page. It's a fun little task. Here's my solution, with lots of fiddly maths to make the circle look as clean as possible.

import sys
r = int(sys.argv[1])

for y in xrange(-r*1.3,r*1.3,2):
    for x in xrange(-r*1.3,r*1.3):
        sys.stdout.write('#$@%&0*;:,. '[min(abs(r*r - (x*x+y*y))/(r/2), 11)])
    print

Circle of r = 10:

        ,;*0&&&0*;,       
     ,*%$#@%%%%%@#$%*,    
   ,*@#%0;:,...,:;0%#@*,  
  ,0$@0;.         .;0@$0, 
 .*$@0:             :0@$*.
 ,&#%;.             .;%#&,
 ,&#%;.             .;%#&,
 .*$@0:             :0@$*.
  ,0$@0;.         .;0@$0, 
   ,*@#%0;:,...,:;0%#@*,  
     ,*%$#@%%%%%@#$%*,    
        ,;*0&&&0*;,      

r = 30:

                             .,:;*00&&&&&&&00*;:,.                            
                         ,;0&@$##$@@%%%%%%%@@$##$@&0;,                        
                     .;0%$#$%&*;:,,..     ..,,:;*&%$#$%0;.                    
                   :0%#$%0;:.                     .:;0%$#%0:                  
                .;&$#%0;,                             ,;0%#$&;.               
               ;&$$%*,                                   ,*%$$&;              
             ,0@#%*,                                       ,*%#@0,            
            ;&#@0:                                           :0@#&;           
           ;%#%*,                                             ,*%#%;          
          ;%#%;.                                               .;%#%;         
         :&#%*.                                                 .*%#&:        
        ,0$@*,                                                   ,*@$0,       
        ;%#&:                                                     :&#%;       
       ,0$@*,                                                     ,*@$0,      
       :&#%;.                                                     .;%#&:      
       :&#%;                                                       ;%#&:      
       :&#%;                                                       ;%#&:      
       :&#%;.                                                     .;%#&:      
       ,0$@*,                                                     ,*@$0,      
        ;%#&:                                                     :&#%;       
        ,0$@*,                                                   ,*@$0,       
         :&#%*.                                                 .*%#&:        
          ;%#%;.                                               .;%#%;         
           ;%#%*,                                             ,*%#%;          
            ;&#@0:                                           :0@#&;           
             ,0@#%*,                                       ,*%#@0,            
               ;&$$%*,                                   ,*%$$&;              
                .;&$#%0;,                             ,;0%#$&;.               
                   :0%#$%0;:.                     .:;0%$#%0:                  
                     .;0%$#$%&*;:,,..     ..,,:;*&%$#$%0;.                    
                         ,;0&@$##$@@%%%%%%%@@$##$@&0;,                        
                             .,:;*00&&&&&&&00*;:,.                            
                                                                  

The explanation I posted on Reddit (sorry about the formatting, hopefully it's still clear).

I guess I should explain what this does. Like the challenge poster, I use the Pythagorean theorem. I look through each coordinate in a square of size 2.6r * 2.6r around the centre-point. The circle only really exists in the 2r * 2r square around the point, but the extra padding on all sides allows for the anti-aliasing to continue further along the edges.

Mathematically, a circle consists of the coordinates (x, y) that satisfy x2 + y2 = r2, r being the radius of the circle. Any points we encounter that satisfy this should be "coloured in" as dark as possible. Some points do not fall exactly on the circle, but they fall close to the circle: the (absolute) difference between r2 and x2 + y2 is (if I am not mistaken) the distance between (x,y) and the closest point on the circle to (x,y). If the difference is 0, we have the case above. Otherwise, the larger the value, the less it is part of the circle and so the lighter we should 'colour' it.

To colour points, I index the 'colouring' string by a scaled value of this closeness, scaled so the width of the ring remains the same for different values of r.

Thursday, October 6, 2011

Space-efficient Memoization with dicts

Python's not usually considered a good language for memory- or time-intensive tasks, but its extensive standard library and variety of built-ins make it an invaluable tool for algorithmic work.

Consider the task of memoizing a recursive function. Typically, a large d-dimensional array is used as a cache, where d is the number of parameters in the subproblem's state. This is great for bottom-up iterative dynamic programming, where the optimal solution for every state must be evaluated -- every cell of the array will be occupied, so there's no redundant space. However, many problems do not require that every state be examined. Consider a case of the unbounded knapsack problem where you have a knapsack of size 15 and 3 items available of costs 5, 7, and 11. The bottom-up approach will evaluate the optimal solution for states 1, 2, 3, 4, 6, 7, 8 and 9, none of which contribute at all to the final optimal selection of items; after all, there's no way to fill a knapsack of those sizes given those items. The top-down memoized recursion approach, on the other hand, will only evaluate states which can theoretically contribute to the optimal solution (even if they do not).

Despite the overhead created by recursive calls and a much higher proportion of cache misses, the top-down memoized approach is often a lot faster than the bottom-up dynamic programming solution. However, people will still generally use the d dimensional array for caching, even when memoizing. This leads to lots of unused cells and wasted space.

Python gives you another option: instead of an array, use a dictionary of {d-tuple representing state : optimal solution for state}. Dictionaries have extremely fast look-up times, and the amount of memory saved is substantial.

Consider the 'matrix sum' problem (as seen in Project Euler #345), a variant of the classic n-rooks problem. You're given a n x n grid (chessboard?) where each cell has a value associated with it. Your task is to place n rooks on the board such that none of them threaten another (no two rooks lie on the same row or column) and the sum of values in the cells they occupy is maximised.

This is an exponential dynamic programming problem, where the state of a subproblem has two parameters: the current row, and a binary string representing the columns that have already been occupied and cannot have rooks placed in them from now on. I don't want to elaborate too much on the solution because it's an interesting problem that's worth your attention. There are n possible values for the current row and 2n possible values for the binary string, so in total there should be n * 2n states. Let's work with n = 15. The total theoretical number of states should be 491,520, which would be the number of cells in the d-dimensional array.

However, in practice there are many states that are impossible to reach. For example, there is no state where you're on the very first row and yet all columns are already occupied. Similarly, there's no state where you're on the last row and no columns are occupied. It's possible to mathematically calculate the exact number of reachable solutions (I did it once, it's not very fun) but since this is a programming blog, I'd rather just demonstrate a solution that tells you exactly how many states must be cached.

N = 15
m = [map(int, line.split()) for line in open('matrix.txt')]

cache = {}
def dp(r, u):
    if (r,u) not in cache:
        cache[(r,u)] = 0 if r==N else max(dp(r+1, u|(1<<c)) + m[r][c] for c in xrange(N) if not u&(1<<c))
    return cache[(r, u)]

print dp(0,0), len(cache)

Output:

32768

That's right ladies and gentlemen, only 32768 states are reachable. That's 1/15th of the original memory usage. Obviously you could do the same thing in another language, but Python makes it so easy. Have fun implementing your own hash table in C :)

Monday, September 5, 2011

A Practical Use for the 'complex' Type

In a recent programming competition, a question asked competitors to generate certain fractals (Julia sets, to be precise). This task requires understanding of the mathematics behind fractals; specifically, complex numbers. The judges assumed that most competitors would be unfamiliar with complex numbers and how to manipulate them programmatically. Hence, about half of the problem statement documented the different operations that could be done with complex numbers: polar/Cartesian conversions, getting the modulus and the angle, how the arithmetic operators worked, etc.

Luckily, I recalled that Python has built-in support for complex numbers. All I had to do was plug in the given formulae and the problem basically solved itself.

In celebration, here's a short piece of Python code which generates a 200x200 Mandelbrot set in ASCII, using Python's complex number support and the cmath library. cmath is to complex numbers as math is to real numbers: it provides loads of helpful functions you might need when dealing with complex numbers. Here, I use cmath to get the modulus of z.

import math, cmath
RX, RY = (-2.5,1), (-1,1)

# maps a pixel (x,y) into a (x,y) point on the Mandelbrot canvas
def lmap(n, (a,b)): return n/200.0 * (b-a) + a

def value(x,y):
    z, c = 0, complex(lmap(x, RX), lmap(y, RY))
    iters = 0
    while cmath.polar(z)[0] < 4 and iters < 1000:
        z = z*z + c
        iters += 1
    return int(math.log(iters)/math.log(1000) * 10.0)
        
out = open('out.txt', 'w')
for r in xrange(200):
    for c in xrange(200):
        out.write(' .:;*0&%@$#'[value(r,c)])
    out.write('\n')

A small portion of the generated fractal:

Wednesday, August 3, 2011

Anagram Searching -- A Job Interview Question

Occasionally a problem comes along that stuns me in its simplicity and elegance. This one in particular is from the 2006 International Olympiad in Informatics (IOI) under the name Writing. It's a great question to give in a job interview because a bruteforce solution is easy to work out, and some may stop there. However, with a bit more effort, it's possible to develop a solution that improves on the bruteforce by orders of magnitude.

A summary of the problem is as follows:

You're given a parent string and a query string of assorted characters of some alphabet (ASCII works nicely). You must determine how many times the query string -- or a permutation of the query string -- appears in the parent string.

Here is a typical thought-path one may take to the optimal solution. In order to evaluate the efficiency of each algorithm, we'll call the parent string's length n, the query string's length k and the alphabet size s.

1. Bruteforce


Generate all permutations of the query string, then do a naive sub-string count on the parent string for each permutation.

import itertools
sum(parent.count(perm) for perm in itertools.permutations(query))

Complexity: O(k!) for generating the permutations, then O(nk) for a naive substring counting algorithm, giving O(nk * k!).


2. "Crossing off" method


There are lots of common string search/manipulation problems in circulation, and most involve strings that must be treated as ordered sequences; that is, you can't mess around with the order of the characters.

Instead of treating this problem as an extension of those problems with an extra restriction, take advantage of the unordered property.
Here's a new way of looking at the problem: for every consecutive set of k characters in the parent string, is that set of k characters the same as the set of characters in our query string?

We've now introduced a new problem: efficient comparison of character sets.

Let our first string be "goldy" and our second be "dogle". A fairly obvious way of checking that the characters are the same is by using a "crossing-off" method. For each character in "goldy", check if it's in "dogle". If it is, and you haven't encountered it on a previous pass, cross it off. make sure the character you find in the second string not already been crossed off; "aab" should not be considered the same as "abb". If at the end of the process all characters have been crossed off and early termination didn't occur then your strings match.

Complexity: For every consecutive set of k characters in the parent (O(n) sets), for every character in that set, check if it's in the query string and if so, cross if off. This algorithm has a time complexity of O(nk2) which is already a massive improvement over the brute-force above.

3. Sort + compare


We can improve on the cross-off method. A better way to compare two sets is to sort both sets then perform an element-by-element comparison. "goldy" becomes "dgloy", "dogle" becomes "deglo", and a standard string comparison shows that the strings are unequal. Since you're probably in a language with a linearithmic sort in its standard library, this solution is easy to implement and works well.

sortedQuery = sorted(query)
print sum(sorted(parent[i:i+len(query)]) == sortedQuery for i in xrange(len(parent) - len(query) + 1))    

Complexity: You're still going through every consecutive block of k characters in the parent string, but this time you just have to compare the sorted block to the sorted query string. O(nk lg k).

4. Constant-time lookup.


It turns out that sort-based set comparison is suboptimal. We can take advantage of the array structure's constant-time random access to create a lookup table of the form lookup[char] = number of occurences of 'char' in query string, for every value of char in our alphabet. We first generate the lookup table in O(k) time and store it. As usual we'll iterate through every consecutive block of k characters in the parent string, but this time instead of sorting, we populate a second lookup table with the data from this block. To compare the sets all we have to do is compare the lookup tables.

Complexity: For each block, generate lookup tables (O(k)) then compare (O(s)). O(n(s + k)).

There's a slight optimisation to be made. If s is large and k is small, our table comparison will end up going through lots of characters which never appear in the query. To prevent this, we can store the characters we come across when constructing the second lookup table. When we compare the tables, we only have to compare the counts for these characters.

import string

numMatches = 0

# allows for the O(nk) optimisation
queryChars = set(query)

correctTable = [0] * 127
for c in query:
    correctTable[ord(c)] += 1

testTable = [0] * 127

for i in xrange(len(parent)-len(query) + 1):
    thisBlock = parent[i:i+len(query)]
    
    for c in thisBlock:
        testTable[ord(c)] += 1

    for c in queryChars:
        if correctTable[ord(c)] != testTable[ord(c)]:
            break
    else:
        numMatches += 1
        
    for c in thisBlock:
        testTable[ord(c)] -= 1

print numMatches

Complexity: Generate the second lookup table in O(k), compare them in O(k) too. O(nk).

5. Magic


The dominating mind-frame when one thinks about this problem is that it's just a twist on the well-known problem of set comparisons: the solution has to loop over each consecutive k-block in the parent string, and my job is to make the comparison between the block and the query string as fast as possible. The block string/query string comparison is lower-bounded by O(n(s + k)).

But this approach always leaves with you two loops, even though the inner loop covers the exact same content as the outer loop. The optimal solution is obtained by eliminating redundancy in the comparison loop by integrating it with the outer loop.

We have to have an O(n) somewhere, if only for inputting the parent string. Our previous solution was doing two O(k) things on top of that: generating the second lookup table, then comparing the two tables. Say we generate a lookup table for the first block of the parent string, finish doing all our processing for that block, then move on to our next block. Our previous algorithm will generate a brand new lookup table now, even though we've just shifted over by a single character. The only rows that can look different between the two tables are the row of the first character of the previous block (which is now not covered by our new block) and the row of the last character in our new block (which was not previously covered). All the other rows will look exactly the same each time. Now that we're only updating two entries each time, generating the table is now O(k) for the first iteration and O(1) for each of the n-k blocks following: O(n). We just stumbled upon an O(n * min(k,s)) solution (you can derive the complexity from the observations above), but let's go straight on.

Now to apply this logic to comparing the two lookup tables. Previously, it was easy to know whether or not a given block's table matches our query string's table: if all of the character counts were the same, they match. The goal is to only operate on the character our block 'left behind' and the character our block is 'picking up'.

Throughout the program, keep track of a distance (similar to the concept of Hamming distance). distance will contain an integer representing the number of unique characters in our current block's table whose counts match that character's count in the query string's table. This will become clearer in a moment. Consider the query string 'hello and the parent string 'ohelol':

current block: 'ohelo'
h e l o
query string table 1 1 2 1
current block table 1 1 1 2
matches? 1 1 0 02

Here, the distance is 2 because only two character counts -- that of the 'h' and that of the 'e' -- match. Now look at the next block:

current block: 'helol'
h e l o
query string table 1 1 2 1
current block table 1 1 2 1
matches? 1 1 1 14

Now our distance is 4. If this distance is equal to the number of unique characters in the query string (which it is), our current block is a matching block. Now do what you were doing before, but instead of comparing the lookup tables, just modify the distance based on whether the 'new' block character and 'old' block character are contained inside the query string, then compare the distance to the number of unique characters in the query string. Phew!

Complexity: O(n). Can I go to sleep now?

This is an incredibly irritating algorithm to code. Here's a C++ implementation (fitting to the 'horrendous style' paradigm of this blog) to keep you busy.

for (int i=0 ; i<query_length ; i++) {
	scanf("%c", &temp);
	query_table[letter_index(temp)] ++;
}
scanf("\n%s", parent);

for (int i=0 ; i<=letter_index('z') ; i++) expected_distance += (query_table[i] > 0);

for (int i=0 ; i<parent_length ; i++) {
	if (i - query_length >= 0) {
		distance -= block_table[letter_index(parent[i-query_length])] == query_table[letter_index(parent[i-query_length])];
		distance += --block_table[letter_index(parent[i-query_length])] == query_table[letter_index(parent[i-query_length])];
	}
	distance -= block_table[letter_index(parent[i])] == query_table[letter_index(parent[i])];
	distance += ++block_table[letter_index(parent[i])] == query_table[letter_index(parent[i])];
	output += (distance == expected_distance);
}



Friday, June 17, 2011

A Peculiar Numbering System

Using the partial function application and composition from the previous post, we can represent the natural numbers in an interesting way.

def partial(f, *p):
    return lambda *q: f(*(p + q))

def compose(*fs):
    return lambda x: reduce(lambda a,b: b(a), (fs+(x,))[::-1])

def represent(n):
    return compose(*(partial(int.__add__, 1),) * n)(0)

assert represent(0) == 0
assert represent(1) == 1
assert represent(1337) == 1337

The nth natural number is the function λx.x+1 composed with itself n times, applied onto the number 0.

This is related to the concept of Church numerals.

Friday, March 4, 2011

Partial Function Application and Function Composition

This following function demonstrates one of my favourite concepts in functional programming -- that of partial function application:

def pfa(f, *p):
    return lambda *q: f(*(p + q))

Partial function application is where you fix particular parameters of a function, producing a new function with the fixed parameters substituted in. For example, say I had a function add(a,b) = a + b. I can use partial function to 'fix' the first argument, a, to a particular number x, and this produces a new function add(b) = x + b.

In practice, PFA is often a more compact and readable version of the ugly anonymous functions you give to higher-order functions such as lambda. For example, consider the following code extract which transforms [a,b,c,...] to [2^a, 2^b, 2^c, ...]:

map(lambda n: pow(2, n), list)

Now take a look at how it is done with PFA:

map(PFA(pow, 2), list)

PFA in this instance creates a new function by substituting the value '2' as the first argument of the pow function, transforming pow(b, e) = b^e into pow(e) = 2^e.

Haskell's PFA notation is even more elegant; because everything in Haskell is just chained evaluation of partial functions, there's no special or unique notation for it, so the following is valid code:

map (2^) list

That maps the partial function 2^ onto list. Haskell also has lambdas, but they're no where near as short or understandable.

map (\x -> 2^x) list

The Python implementation above can also fix multiple arguments at one time, as in pfa(f(a,b,c), a, b) -> f(c). Sometimes it may be preferable to fix the second or third argument, while leaving the first argument variable -- Haskell deals with this elegantly as usual, but I haven't found a nice way to do it in Python.

note: apparently the functools module already has a partial() function for PFA. whoops :)

Function composition is another concept (again, originally mathematical) with similar roots. Here's an implementation:

def compose(*fs):
    return lambda x: reduce(lambda a,b: b(a), (fs+x)[::-1])


Essentially, this takes two functions f(x) and g(y) and produces a new function h(y) == f(g(y)). The new function is often notated as (f.g)(y), as in "f.g is the composition of f and g." Like partial function application, it is rarely a necessity to use function composition, but again it can make things more compact and more intuitive.

A common construct I use when dealing with strings is as follows:

new_s = old_s.strip().split()

This can be composed as follows:

new_s = compose(str.split, str.strip)(old_s)

A better example:

hash(bin(bool(id(sum(range(10))))))
# becomes
compose(hash, bin, bool, id, sum, range)(10)