If I give you a list of words, can you find the shortest string that contains all the words?
I can!
It turns out that this problem is equivalent to TSP, which means there are no polynomial-time algorithms to solve it. However, we can tackle it the same way as we tackle TSP: do the dynamic programming solution, which although slow, is the optimal correct algorithm. There's a trick to it though, which caught me out the first time: if one word is a complete subset of another and does not appear at the beginning or the end (as in the case ['abc', 'b']), the edge traversal principle doesn't work. In cases like these, the smaller word must first be removed from the list. You can see this below with ['germanic', 'german', 'germ']. EDIT: And if two words are the same, they'll both be substrings of each other, but we don't want to remove both of them. Thanks to Werner Lemberg for picking this up in the comments.
For once, some neat code. What's going on???
all_words = ['ginger', 'german', 'minutes', 'testing', 'tingling', 'minor', 'testicle', 'manage', 'guilt', 'germanic', 'normal', 'malt', 'german', 'germ']
all_N = len(all_words)
words = []
# remove strings which are substrings of others
for i in xrange(all_N):
for j in xrange(all_N):
if i != j and all_words[i] != all_words[j] and all_words[i] in all_words[j]:
break
else:
words.append(all_words[i])
N = len(words)
# determine the numerical overlap between two strings a,b
def overlap(a, b):
best = 0
for i in xrange(1, min(len(a), len(b))+1):
if b.startswith(a[-i:]):
best = i
return best
cost = [[None] * N for _ in xrange(N)]
# Precompute edge costs
# for every pair of words with indices u,v
for u in xrange(N):
for v in xrange(N):
# work out the best compressed concatenation you can make with u then v
cost[u][v] = len(words[u]) + len(words[v]) - overlap(words[u], words[v])
cache = {}
backtrace = {}
# top-down DP
def solve(used, last):
if (used,last) not in cache:
bestCost = 0
bestOption = None
# the base case is when used == 0. using no words, the optimal solution is of length 0
if used != 0:
bestCost = float('inf')
# for each word we can use
for i in xrange(N):
# if the word is as yet unused
if (1 << i) & used:
# calc the cost of using it
newCost = cost[i][last] + solve(used & ~(1<<i), i)
# if we've reached a new best solution, update stuff
if newCost < bestCost:
bestCost = newCost
bestOption = i
# cache stuff
cache[(used, last)] = bestCost
backtrace[(used, last)] = bestOption
return cache[(used, last)]
# run it for all possible starting cases
bestCost = float('inf')
bestOption = None
for i in xrange(N):
cur = solve(((1<<N)-1) & ~(1<<i), i)
if cur < bestCost:
bestCost = cur
bestOption = i
# reconstruct the words that were used
soln = []
used = (1<<N) - 1
last = bestOption
while last is not None:
soln.append(words[last])
used &= ~(1<<last)
last = backtrace[(used, last)]
soln.reverse()
# now compress the words of the solution into the final string
cur = soln[0]
for i in xrange(1, N):
cur += soln[i][overlap(cur, soln[i]):]
print cur
The output is minutestinglingingermanicminormaltmanageguiltesticle. Fascinating stuff, I know.
Very nice! However, there is a buglet in your code: If I have two identical entries, both of them are removed (since they are substrings of each other).
ReplyDeleteMhmm, pressed the publish key too early.
DeleteFor example, add `foo' two times to `all_words', and it doesn't appear in the output.
Good catch! As you can tell, I often get too excited to test properly. I appropriately updated the entry. Thanks
DeleteHey mate, thanks for this, it's a nice algorithm.
ReplyDeleteI'm looking for something that will go to 50 strings and this one will only get to about 20 on my machine before running out of time and memory.
Thanks though.
I'm in NZ by the way.