Occasionally a problem comes along that stuns me in its simplicity and elegance. This one in particular is from the 2006 International Olympiad in Informatics (IOI) under the name Writing. It's a great question to give in a job interview because a bruteforce solution is easy to work out, and some may stop there. However, with a bit more effort, it's possible to develop a solution that improves on the bruteforce by orders of magnitude.
A summary of the problem is as follows:
Here is a typical thought-path one may take to the optimal solution. In order to evaluate the efficiency of each algorithm, we'll call the parent string's length n, the query string's length k and the alphabet size s.
Generate all permutations of the query string, then do a naive sub-string count on the parent string for each permutation.
Complexity: O(k!) for generating the permutations, then O(nk) for a naive substring counting algorithm, giving O(nk * k!).
There are lots of common string search/manipulation problems in circulation, and most involve strings that must be treated as ordered sequences; that is, you can't mess around with the order of the characters.
Instead of treating this problem as an extension of those problems with an extra restriction, take advantage of the unordered property.
Here's a new way of looking at the problem: for every consecutive set of k characters in the parent string, is that set of k characters the same as the set of characters in our query string?
We've now introduced a new problem: efficient comparison of character sets.
Let our first string be "goldy" and our second be "dogle". A fairly obvious way of checking that the characters are the same is by using a "crossing-off" method. For each character in "goldy", check if it's in "dogle". If it is, and you haven't encountered it on a previous pass, cross it off. make sure the character you find in the second string not already been crossed off; "aab" should not be considered the same as "abb". If at the end of the process all characters have been crossed off and early termination didn't occur then your strings match.
Complexity: For every consecutive set of k characters in the parent (O(n) sets), for every character in that set, check if it's in the query string and if so, cross if off. This algorithm has a time complexity of O(nk2) which is already a massive improvement over the brute-force above.
We can improve on the cross-off method. A better way to compare two sets is to sort both sets then perform an element-by-element comparison. "goldy" becomes "dgloy", "dogle" becomes "deglo", and a standard string comparison shows that the strings are unequal. Since you're probably in a language with a linearithmic sort in its standard library, this solution is easy to implement and works well.
Complexity: You're still going through every consecutive block of k characters in the parent string, but this time you just have to compare the sorted block to the sorted query string. O(nk lg k).
It turns out that sort-based set comparison is suboptimal. We can take advantage of the array structure's constant-time random access to create a lookup table of the form lookup[char] = number of occurences of 'char' in query string, for every value of char in our alphabet. We first generate the lookup table in O(k) time and store it. As usual we'll iterate through every consecutive block of k characters in the parent string, but this time instead of sorting, we populate a second lookup table with the data from this block. To compare the sets all we have to do is compare the lookup tables.
Complexity: For each block, generate lookup tables (O(k)) then compare (O(s)). O(n(s + k)).
There's a slight optimisation to be made. If s is large and k is small, our table comparison will end up going through lots of characters which never appear in the query. To prevent this, we can store the characters we come across when constructing the second lookup table. When we compare the tables, we only have to compare the counts for these characters.
Complexity: Generate the second lookup table in O(k), compare them in O(k) too. O(nk).
The dominating mind-frame when one thinks about this problem is that it's just a twist on the well-known problem of set comparisons: the solution has to loop over each consecutive k-block in the parent string, and my job is to make the comparison between the block and the query string as fast as possible. The block string/query string comparison is lower-bounded by O(n(s + k)).
But this approach always leaves with you two loops, even though the inner loop covers the exact same content as the outer loop. The optimal solution is obtained by eliminating redundancy in the comparison loop by integrating it with the outer loop.
We have to have an O(n) somewhere, if only for inputting the parent string. Our previous solution was doing two O(k) things on top of that: generating the second lookup table, then comparing the two tables. Say we generate a lookup table for the first block of the parent string, finish doing all our processing for that block, then move on to our next block. Our previous algorithm will generate a brand new lookup table now, even though we've just shifted over by a single character. The only rows that can look different between the two tables are the row of the first character of the previous block (which is now not covered by our new block) and the row of the last character in our new block (which was not previously covered). All the other rows will look exactly the same each time. Now that we're only updating two entries each time, generating the table is now O(k) for the first iteration and O(1) for each of the n-k blocks following: O(n). We just stumbled upon an O(n * min(k,s)) solution (you can derive the complexity from the observations above), but let's go straight on.
Now to apply this logic to comparing the two lookup tables. Previously, it was easy to know whether or not a given block's table matches our query string's table: if all of the character counts were the same, they match. The goal is to only operate on the character our block 'left behind' and the character our block is 'picking up'.
Throughout the program, keep track of a distance (similar to the concept of Hamming distance). distance will contain an integer representing the number of unique characters in our current block's table whose counts match that character's count in the query string's table. This will become clearer in a moment. Consider the query string 'hello and the parent string 'ohelol':
current block: 'ohelo'
Here, the distance is 2 because only two character counts -- that of the 'h' and that of the 'e' -- match. Now look at the next block:
current block: 'helol'
Now our distance is 4. If this distance is equal to the number of unique characters in the query string (which it is), our current block is a matching block. Now do what you were doing before, but instead of comparing the lookup tables, just modify the distance based on whether the 'new' block character and 'old' block character are contained inside the query string, then compare the distance to the number of unique characters in the query string. Phew!
Complexity: O(n). Can I go to sleep now?
This is an incredibly irritating algorithm to code. Here's a C++ implementation (fitting to the 'horrendous style' paradigm of this blog) to keep you busy.
A summary of the problem is as follows:
You're given a parent string and a query string of assorted characters of some alphabet (ASCII works nicely). You must determine how many times the query string -- or a permutation of the query string -- appears in the parent string.
Here is a typical thought-path one may take to the optimal solution. In order to evaluate the efficiency of each algorithm, we'll call the parent string's length n, the query string's length k and the alphabet size s.
1. Bruteforce
Generate all permutations of the query string, then do a naive sub-string count on the parent string for each permutation.
import itertools sum(parent.count(perm) for perm in itertools.permutations(query))
Complexity: O(k!) for generating the permutations, then O(nk) for a naive substring counting algorithm, giving O(nk * k!).
2. "Crossing off" method
There are lots of common string search/manipulation problems in circulation, and most involve strings that must be treated as ordered sequences; that is, you can't mess around with the order of the characters.
Instead of treating this problem as an extension of those problems with an extra restriction, take advantage of the unordered property.
Here's a new way of looking at the problem: for every consecutive set of k characters in the parent string, is that set of k characters the same as the set of characters in our query string?
We've now introduced a new problem: efficient comparison of character sets.
Let our first string be "goldy" and our second be "dogle". A fairly obvious way of checking that the characters are the same is by using a "crossing-off" method. For each character in "goldy", check if it's in "dogle". If it is, and you haven't encountered it on a previous pass, cross it off. make sure the character you find in the second string not already been crossed off; "aab" should not be considered the same as "abb". If at the end of the process all characters have been crossed off and early termination didn't occur then your strings match.
Complexity: For every consecutive set of k characters in the parent (O(n) sets), for every character in that set, check if it's in the query string and if so, cross if off. This algorithm has a time complexity of O(nk2) which is already a massive improvement over the brute-force above.
3. Sort + compare
We can improve on the cross-off method. A better way to compare two sets is to sort both sets then perform an element-by-element comparison. "goldy" becomes "dgloy", "dogle" becomes "deglo", and a standard string comparison shows that the strings are unequal. Since you're probably in a language with a linearithmic sort in its standard library, this solution is easy to implement and works well.
sortedQuery = sorted(query) print sum(sorted(parent[i:i+len(query)]) == sortedQuery for i in xrange(len(parent) - len(query) + 1))
Complexity: You're still going through every consecutive block of k characters in the parent string, but this time you just have to compare the sorted block to the sorted query string. O(nk lg k).
4. Constant-time lookup.
It turns out that sort-based set comparison is suboptimal. We can take advantage of the array structure's constant-time random access to create a lookup table of the form lookup[char] = number of occurences of 'char' in query string, for every value of char in our alphabet. We first generate the lookup table in O(k) time and store it. As usual we'll iterate through every consecutive block of k characters in the parent string, but this time instead of sorting, we populate a second lookup table with the data from this block. To compare the sets all we have to do is compare the lookup tables.
Complexity: For each block, generate lookup tables (O(k)) then compare (O(s)). O(n(s + k)).
There's a slight optimisation to be made. If s is large and k is small, our table comparison will end up going through lots of characters which never appear in the query. To prevent this, we can store the characters we come across when constructing the second lookup table. When we compare the tables, we only have to compare the counts for these characters.
import string
numMatches = 0
# allows for the O(nk) optimisation
queryChars = set(query)
correctTable = [0] * 127
for c in query:
correctTable[ord(c)] += 1
testTable = [0] * 127
for i in xrange(len(parent)-len(query) + 1):
thisBlock = parent[i:i+len(query)]
for c in thisBlock:
testTable[ord(c)] += 1
for c in queryChars:
if correctTable[ord(c)] != testTable[ord(c)]:
break
else:
numMatches += 1
for c in thisBlock:
testTable[ord(c)] -= 1
print numMatches
Complexity: Generate the second lookup table in O(k), compare them in O(k) too. O(nk).
5. Magic
The dominating mind-frame when one thinks about this problem is that it's just a twist on the well-known problem of set comparisons: the solution has to loop over each consecutive k-block in the parent string, and my job is to make the comparison between the block and the query string as fast as possible. The block string/query string comparison is lower-bounded by O(n(s + k)).
But this approach always leaves with you two loops, even though the inner loop covers the exact same content as the outer loop. The optimal solution is obtained by eliminating redundancy in the comparison loop by integrating it with the outer loop.
We have to have an O(n) somewhere, if only for inputting the parent string. Our previous solution was doing two O(k) things on top of that: generating the second lookup table, then comparing the two tables. Say we generate a lookup table for the first block of the parent string, finish doing all our processing for that block, then move on to our next block. Our previous algorithm will generate a brand new lookup table now, even though we've just shifted over by a single character. The only rows that can look different between the two tables are the row of the first character of the previous block (which is now not covered by our new block) and the row of the last character in our new block (which was not previously covered). All the other rows will look exactly the same each time. Now that we're only updating two entries each time, generating the table is now O(k) for the first iteration and O(1) for each of the n-k blocks following: O(n). We just stumbled upon an O(n * min(k,s)) solution (you can derive the complexity from the observations above), but let's go straight on.
Now to apply this logic to comparing the two lookup tables. Previously, it was easy to know whether or not a given block's table matches our query string's table: if all of the character counts were the same, they match. The goal is to only operate on the character our block 'left behind' and the character our block is 'picking up'.
Throughout the program, keep track of a distance (similar to the concept of Hamming distance). distance will contain an integer representing the number of unique characters in our current block's table whose counts match that character's count in the query string's table. This will become clearer in a moment. Consider the query string 'hello and the parent string 'ohelol':
current block: 'ohelo'
| h | e | l | o | ||
| query string table | 1 | 1 | 2 | 1 | |
| current block table | 1 | 1 | 1 | 2 | |
| matches? | 1 | 1 | 0 | 0 | 2 |
Here, the distance is 2 because only two character counts -- that of the 'h' and that of the 'e' -- match. Now look at the next block:
current block: 'helol'
| h | e | l | o | ||
| query string table | 1 | 1 | 2 | 1 | |
| current block table | 1 | 1 | 2 | 1 | |
| matches? | 1 | 1 | 1 | 1 | 4 |
Now our distance is 4. If this distance is equal to the number of unique characters in the query string (which it is), our current block is a matching block. Now do what you were doing before, but instead of comparing the lookup tables, just modify the distance based on whether the 'new' block character and 'old' block character are contained inside the query string, then compare the distance to the number of unique characters in the query string. Phew!
Complexity: O(n). Can I go to sleep now?
This is an incredibly irritating algorithm to code. Here's a C++ implementation (fitting to the 'horrendous style' paradigm of this blog) to keep you busy.
for (int i=0 ; i<query_length ; i++) {
scanf("%c", &temp);
query_table[letter_index(temp)] ++;
}
scanf("\n%s", parent);
for (int i=0 ; i<=letter_index('z') ; i++) expected_distance += (query_table[i] > 0);
for (int i=0 ; i<parent_length ; i++) {
if (i - query_length >= 0) {
distance -= block_table[letter_index(parent[i-query_length])] == query_table[letter_index(parent[i-query_length])];
distance += --block_table[letter_index(parent[i-query_length])] == query_table[letter_index(parent[i-query_length])];
}
distance -= block_table[letter_index(parent[i])] == query_table[letter_index(parent[i])];
distance += ++block_table[letter_index(parent[i])] == query_table[letter_index(parent[i])];
output += (distance == expected_distance);
}