Tuesday, January 3, 2012

Shortest Superstring Problem

If I give you a list of words, can you find the shortest string that contains all the words?

I can!

It turns out that this problem is equivalent to TSP, which means there are no polynomial-time algorithms to solve it. However, we can tackle it the same way as we tackle TSP: do the dynamic programming solution, which although slow, is the optimal correct algorithm. There's a trick to it though, which caught me out the first time: if one word is a complete subset of another and does not appear at the beginning or the end (as in the case ['abc', 'b']), the edge traversal principle doesn't work. In cases like these, the smaller word must first be removed from the list. You can see this below with ['germanic', 'german', 'germ']. EDIT: And if two words are the same, they'll both be substrings of each other, but we don't want to remove both of them. Thanks to Werner Lemberg for picking this up in the comments.

For once, some neat code. What's going on???

all_words = ['ginger', 'german', 'minutes', 'testing', 'tingling', 'minor', 'testicle', 'manage', 'guilt', 'germanic', 'normal', 'malt', 'german', 'germ']
all_N = len(all_words)

words = []

# remove strings which are substrings of others
for i in xrange(all_N):
    for j in xrange(all_N):
        if i != j and all_words[i] != all_words[j] and all_words[i] in all_words[j]:
            break
    else:
        words.append(all_words[i])

N = len(words)
        
 
# determine the numerical overlap between two strings a,b
def overlap(a, b):
    best = 0
    for i in xrange(1, min(len(a), len(b))+1):
        if b.startswith(a[-i:]):
            best = i
    return best
 
 
cost = [[None] * N for _ in xrange(N)]
# Precompute edge costs
# for every pair of words with indices u,v
for u in xrange(N):
    for v in xrange(N):
        # work out the best compressed concatenation you can make with u then v
        cost[u][v] = len(words[u]) + len(words[v]) - overlap(words[u], words[v])
 
                 
cache = {}
backtrace = {}
                 
# top-down DP
def solve(used, last):
    if (used,last) not in cache:
     
        bestCost = 0
        bestOption = None
 
        # the base case is when used == 0. using no words, the optimal solution is of length 0
        if used != 0:
         
            bestCost = float('inf')
         
            # for each word we can use
            for i in xrange(N):
                 
                # if the word is as yet unused
                if (1 << i) & used:
                 
                    # calc the cost of using it
                    newCost = cost[i][last] + solve(used & ~(1<<i), i)
                     
                    # if we've reached a new best solution, update stuff
                    if newCost < bestCost:
                        bestCost = newCost
                        bestOption = i
         
        # cache stuff
        cache[(used, last)] = bestCost
        backtrace[(used, last)] = bestOption
     
    return cache[(used, last)]
 
 
# run it for all possible starting cases
bestCost = float('inf')
bestOption = None
for i in xrange(N):
    cur = solve(((1<<N)-1) & ~(1<<i), i)
    if cur < bestCost:
        bestCost = cur
        bestOption = i
 
 
# reconstruct the words that were used
soln = []
used = (1<<N) - 1
last = bestOption
while last is not None:
    soln.append(words[last])
    used &= ~(1<<last)
    last = backtrace[(used, last)]
soln.reverse()
 
 
# now compress the words of the solution into the final string
cur = soln[0]
for i in xrange(1, N):
    cur += soln[i][overlap(cur, soln[i]):]
print cur

The output is minutestinglingingermanicminormaltmanageguiltesticle. Fascinating stuff, I know.