Algorithmic Gymnastics: February 2014

This is an original string manipulation problem that I wrote. It was included on the 2014 Australian Invitational Informatics Olympiad. In my opinion it would be a question of moderate difficulty on an IOI exam.

You are given a critical string of length $C$ in some arbitrarily large alphabet $\Sigma$ and some text of length $T$ in $\Sigma$ ($C \leq T \leq 10^6$). You are guaranteed that each character in $\Sigma$ appears at most once in the critical string. You are then instructed to perform $U$ update operations; each operation is of the form "set the character at position $i$ in the text to be $x$" ($0 \leq i < T, x \in \Sigma, U \leq 10^6$) . After each update operation, output the maximum number of consecutive occurrences ("runs") of the critical string in the text. The intended time complexity is linearithmic.

The difficulty in this question arises from the number of different sub-algorithms/techniques (three or four of them) that need to be applied to construct a complete solution. What follows is my solution.
Before we can track runs of the critical word within the text, we first need the ability to detect the formation of individual instances of the critical word. Intuitively, some update operations make us closer to a critical word; e.g. if the critical word is "badger", then an update operation "booger" $\to$ "bodger" gets us closer to "badger". Inversely, "bodger" $\to$ "booger" gets us further from the formation of the critical word. Specifically, some $C$-length substring of the text gets closer/further if its edit distance to the critical string (the number of characters that differ across both strings) decreases/increases respectively.
The following observation drives our first technique.

Observation 1. Each update operation results in a decrease in edit distance to at most one instance of a $C$-length substring of the text, and an increase in edit distance to most one instance of a $C$-length substring in the text.
This is a result of the character-distinctness property of the critical string. Say we set a character in the text to 'a'. If the critical string does not contain 'a', then no text substring's edit distance could have decreased. If the critical string contains one occurrence of 'a' at position $p_a$, then only the text substring whose $p_a$th character has become $a$ has a reduction in edit distance. The critical string cannot contain more than one occurrence of 'a' so we need not consider more cases.
Similarly, if we are replacing the character 'b' with 'a', then an increase an edit distance occurs only for the substring whose $p_b$th character is replaced, or for no strings if 'b' does not appear in the critical string.

Sub-algorithm 1. By maintaining a map of character to critical string position (i.e. $c \to p_c$) we can find which $C$-length substring gets closer (if any) and which $C$-length substring gets further (if any) in constant time.
If we are updating position $i$ in the text from character $c$ to character $d$, the substring that gets closer begins at position $i - p_c$ in the text and the substring that gets further begins at position $i - p_d$ (all positions are 0-based by the way).

Sub-algorithm 2. Store, for each position $i \in 0..T-C+1$ in the text, the edit distance of the $C$-length substring beginning at position $i$ in the text. After each update operation, make the zero, one, or two necessary modifications to these values (as indicated by sub-algorithm 1). After replacing the character $c$ at position $i$ with $d$, we can determine in constant time if a critical word instance has been formed or un-formed by querying the edit distance at $i-p_c$ and $i-p_d$. This is all constant time.

Sub-algorithm 2 allow us to detect formations and un-formations of the critical word in the text as we go, in $O(1)$ time for each update. The next step is to detect runs of the critical word in the text. To do this, we introduce a complex data structure into the equation.
I'll begin by posing a more straight-forward problem. We have a binary string. We then perform a series of updates on the string. In each update we set or unset a bit in the string. After each update, we need to output the length of the longest run of 1's in the string.
In fact, we can implement a range tree that handles these exact update and query operations. I won't go into the implementation details, but I will mention what each tree node stores:

The longest run in the interval contained by the node
The size of the run that is incident to the left side of the node's interval
The size of the run that is incident to the right side of the node's interval

We'll call this the longest-run tree. Using this data structure (in fact, $C$ different instances of it!), we get our third technique:

Sub-algorithm 3. Instantiate $C$ longest-run trees, where each represents a different 'offset' of a critical string run within the text. The first 'offset' has the first letter of the critical string at indices 0, $C, 2C, ...$ within the text, the second 'offset' has the first letter at indices $1, C+1, 2C+1, ...$, etc.
Each time we detect the completion of a critical word within the text (using sub-algorithm 2), let the position of the beginning of the critical word be $p_0$. Take the $(p_0 \bmod C)$'th longest-run tree and "set the bit" in the tree at index $\left\lfloor \frac{p_0}{C} \right\rfloor$. Do the same for any critical words that we unformed -- find their offset's tree, find the position in the tree, and unset the bit.
Now if we query that tree, we discover the longest run of consecutive critical words in that 'offset'. Setting and unsetting bits is $O(\lg T)$ and querying is also $O(\lg T)$.

We are almost done! After making an update, we can query the longest run for a particular offset of the critical word in logarithmic time. But how do we know if this offset contains the maximum-length run?
For this, we introduce yet another data structure (an easier one this time). We use a max-first priority queue.

Sub-algorithm 4. Initialise a priority queue of integer pairs where each pair is of the form (maximum run-length in tree, tree ID), sorted in descending order by the former. Each time we "set a bit" in a tree, query the maximum run-length of that tree. Push onto the priority queue the maximum run-length in that tree as well as the tree ID.
Now when it comes time to output the maximum run-length of the critical string, query the priority queue. This will give you the maximum run-length across all trees, as well as the tree in which that maximum run-length occurs. However, we can't necessarily trust this value! What if we'd pushed it to the queue a long time ago, and subsequent updates have removed that run of the critical word? To cater for this, we must now query the tree with the tree ID that we just popped to ensure its run length is truly the longest. If it's not, we ignore this priority queue element, pop it off and try the next one.
You can also use a set as your priority queue, and erase elements when you unset bits rather than the lazy deletion we're using.

Each time we do an update, we push something to the queue. We might set a bit up to $T$ times, so potentially we could have $T$ elements on the queue (though this is really unlikely). A queue with $T$ elements has $O(\lg T)$-time operations. Each time we do an update, we potentially pop several things off the queue, but since the total number of elements on the queue is bounded, this amortizes to one pop per update. So the time complexity of each update is the time complexity of a constant number of range tree operations followed by an amortized constant number of priority queue operations. $O(\lg T + \lg U) = O(\log TU)$.
Since we do this for each update, we end up with $O(U \log TU)$ as required.

The implementation is not as horrible as it seems! Mine was 110 lines of C++.

#include <cstdio>
#include <cstring>
#include <map>
#include <algorithm>
#include <queue>
#include <cassert>
using namespace std;
#define MAX_N 1000005
#define MAX_W MAX_N
#define ROOT 0
#define LC(n) (2*(n)+1)
#define RC(n) (2*(n)+2)

struct node {
    int left, right, mid;
};

node *trees[MAX_N];
map<int, int> subtract;
int counters[MAX_N];
int N, W, Q, word[MAX_W], searchStr[MAX_N];
priority_queue<pair<int, int> > maxConsPQ; // <max cons in tree, tree id>
int LAST;
int total;

void put(node t[], int cur, int u, int ns, int ne, bool v) {
    assert(ns <= u && u <= ne);
    if (ns == u && u == ne) {
        t[cur].left = t[cur].right = t[cur].mid = v;
    } else {
        if (u <= (ns+ne) / 2) {
            put(t, LC(cur), u, ns, (ns+ne)/2, v);
        } else {
            put(t, RC(cur), u, (ns + ne)/2 + 1, ne, v);
        }
        int half = (ne - ns + 1) / 2, mid = 0;
        mid = max(mid, t[LC(cur)].right + t[RC(cur)].left);
        mid = max(mid, max(t[LC(cur)].mid, t[RC(cur)].mid));
        mid = max(mid, max(t[LC(cur)].left, t[RC(cur)].right));
        t[cur].mid = mid;
        t[cur].left = t[LC(cur)].left;
        if (t[LC(cur)].left == half) {
            t[cur].left += t[RC(cur)].left;
        }
        t[cur].right = t[RC(cur)].right;
        if (t[RC(cur)].right == half) {
            t[cur].right += t[LC(cur)].right;
        }
    }
}

void change(int pos, int nc, bool isInitial) {
    int oc = searchStr[pos];
    if (!isInitial && subtract.find(oc) != subtract.end() && pos - subtract[oc] >= 0) {
        int po = pos - subtract[oc];
        if (W == counters[po]) {
            total--;
            put(trees[po % W], ROOT, po / W, 0, LAST, false);
            maxConsPQ.push(make_pair(trees[po%W][ROOT].mid, po % W));
        }
        counters[po]--;
        assert(counters[po] >= 0);
    }
    searchStr[pos] = nc;
    if (subtract.find(nc) != subtract.end() && pos - subtract[nc] >= 0) {
        int pn = pos - subtract[nc];
        counters[pn] ++;
        assert(counters[pn] <= W);
        if (W == counters[pn]) {
            total++;
            put(trees[pn % W], ROOT, pn / W, 0, LAST, true);
            maxConsPQ.push(make_pair(trees[pn%W][ROOT].mid, pn % W));
        }
    }
}

int query() {
    while (!maxConsPQ.empty() && trees[maxConsPQ.top().second][ROOT].mid != maxConsPQ.top().first) {
        maxConsPQ.pop();
    }
    return maxConsPQ.empty() ? 0 : maxConsPQ.top().first;
}

int main() {
    scanf("%d %d", &W, &N);
    int treeSize = 1;
    while (treeSize < N/W) treeSize <<= 1;
    LAST = treeSize - 1;

    for (int w = 0; w < W; w++) {
        scanf("%d", &word[w]);
        trees[w] = (node*) malloc(sizeof(node) * 2*treeSize);
        memset(trees[w], 0, sizeof(node) * 2*treeSize);
        subtract[word[w]] = w;
    }
    for (int n = 0; n < N; n++) {
        scanf("%d", &searchStr[n]);
        change(n, searchStr[n], true);
    }

    scanf("%d", &Q);
    for (int q = 0; q < Q; q++) {
        int i, x;
        scanf("%d %d", &i, &x);
        change(i-1, x, false);
        printf("%d %d\n", total, query());
    }

    return 0;
}

Algorithmic Gymnastics

Sunday, February 16, 2014

An Original String Manipulation Problem