This is an original string manipulation problem that I wrote. It was included on the 2014 Australian Invitational Informatics Olympiad. In my opinion it would be a question of moderate difficulty on an IOI exam.
You are given a critical string of length $C$ in some arbitrarily large alphabet $\Sigma$ and some text of length $T$ in $\Sigma$ ($C \leq T \leq 10^6$). You are guaranteed that each character in $\Sigma$ appears at most once in the critical string. You are then instructed to perform $U$ update operations; each operation is of the form "set the character at position $i$ in the text to be $x$" ($0 \leq i < T, x \in \Sigma, U \leq 10^6$) . After each update operation, output the maximum number of consecutive occurrences ("runs") of the critical string in the text. The intended time complexity is linearithmic.
The difficulty in this question arises from the number of different sub-algorithms/techniques (three or four of them) that need to be applied to construct a complete solution. What follows is my solution.
Before we can track runs of the critical word within the
text, we first need the ability to detect the formation of individual instances
of the critical word. Intuitively, some update operations make us closer to a
critical word; e.g. if the critical word is "badger", then an update operation
"booger" $\to$ "bodger" gets us closer to "badger". Inversely, "bodger" $\to$
"booger" gets us further from the formation of the critical word. Specifically,
some $C$-length substring of the text gets closer/further if its edit distance
to the critical string (the number of characters that differ across both
strings) decreases/increases respectively.
The following observation drives our first technique.
Observation 1. Each update operation results in a decrease in edit
distance to at most one instance of a $C$-length substring of the text, and an
increase in edit distance to most one instance of a $C$-length substring in the
text.
This is a result of the character-distinctness property of the critical
string. Say we set a character in the text to 'a'. If the critical string does
not contain 'a', then no text substring's edit distance could have decreased.
If the critical string contains one occurrence of 'a' at position $p_a$, then
only the text substring whose $p_a$th character has become $a$ has a reduction
in edit distance. The critical string cannot contain more than one occurrence
of 'a' so we need not consider more cases.
Similarly, if we are replacing the character 'b' with 'a', then an increase
an edit distance occurs only for the substring whose $p_b$th character is
replaced, or for no strings if 'b' does not appear in the critical string.
Sub-algorithm 1. By maintaining a map of character to critical
string position (i.e. $c \to p_c$) we can find which $C$-length substring gets
closer (if any) and which $C$-length substring gets further (if any) in
constant time.
If we are updating position $i$ in the text from character $c$ to character
$d$, the substring that gets closer begins at position $i - p_c$ in the text
and the substring that gets further begins at position $i - p_d$ (all positions
are 0-based by the way).
Sub-algorithm 2. Store, for each position $i \in 0..T-C+1$ in the
text, the edit distance of the $C$-length substring beginning at position $i$
in the text. After each update operation, make the zero, one, or two necessary
modifications to these values (as indicated by sub-algorithm 1). After
replacing the character $c$ at position $i$ with $d$, we can determine in
constant time if a critical word instance has been formed or un-formed by
querying the edit distance at $i-p_c$ and $i-p_d$. This is all constant time.
Sub-algorithm 2 allow us to detect formations and un-formations of the
critical word in the text as we go, in $O(1)$ time for each update. The next
step is to detect runs of the critical word in the text. To do this, we
introduce a complex data structure into the equation.
I'll begin by posing a more straight-forward problem. We have a binary
string. We then perform a series of updates on the string. In each update we
set or unset a bit in the string. After each update, we need to output the
length of the longest run of 1's in the string.
In fact, we can implement a range tree that handles these exact update and
query operations. I won't go into the implementation details, but I will
mention what each tree node stores:
- The longest run in the interval contained by the node
- The size
of the run that is incident to the left side of the node's interval
- The size of the run that is incident to the right side of the node's
interval
We'll call this the
longest-run tree. Using this data structure (in
fact, $C$ different instances of it!), we get our third technique:
Sub-algorithm 3. Instantiate $C$ longest-run trees, where each
represents a different 'offset' of a critical string run within the text. The
first 'offset' has the first letter of the critical string at indices 0, $C,
2C, ...$ within the text, the second 'offset' has the first letter at indices
$1, C+1, 2C+1, ...$, etc.
Each time we detect the completion of a critical word within the text
(using sub-algorithm 2), let the position of the beginning of the critical word
be $p_0$. Take the $(p_0 \bmod C)$'th longest-run tree and "set the bit" in the
tree at index $\left\lfloor \frac{p_0}{C} \right\rfloor$. Do the same for any critical words that we
unformed -- find their offset's tree, find the position in the tree, and unset
the bit.
Now if we query that tree, we discover the longest run of consecutive
critical words
in that 'offset'. Setting and unsetting bits is $O(\lg
T)$ and querying is also $O(\lg T)$.
We are almost done! After making an update, we can query the longest run
for a particular offset of the critical word in logarithmic time. But how do we
know if this offset contains the maximum-length run?
For this, we introduce yet another data structure (an easier one this
time). We use a max-first priority queue.
Sub-algorithm 4. Initialise a priority queue of integer pairs where
each pair is of the form (maximum run-length in tree, tree ID), sorted in
descending order by the former. Each time we "set a bit" in a tree, query the
maximum run-length of that tree. Push onto the priority queue the maximum
run-length in that tree as well as the tree ID.
Now when it comes time to output the maximum run-length of the critical
string, query the priority queue. This will give you the maximum run-length
across all trees, as well as the tree in which that maximum run-length occurs.
However, we can't necessarily trust this value! What if we'd pushed it to the
queue a long time ago, and subsequent updates have removed that run of the
critical word? To cater for this, we must now query the tree with the tree ID
that we just popped to ensure its run length is truly the longest. If it's not,
we ignore this priority queue element, pop it off and try the next one.
You can also use a set as your priority queue, and erase elements when you
unset bits rather than the lazy deletion we're using.
Each time we do an update, we push something to the queue. We might set a
bit up to $T$ times, so potentially we could have $T$ elements on the queue
(though this is really unlikely). A queue with $T$ elements has $O(\lg T)$-time
operations. Each time we do an update, we potentially pop several things off
the queue, but since the total number of elements on the queue is bounded, this
amortizes to one pop per update. So the time complexity of each update is the
time complexity of a constant number of range tree operations followed by an
amortized constant number of priority queue operations. $O(\lg T + \lg U) =
O(\log TU)$.
Since we do this for each update, we end up with $O(U \log TU)$ as required.
The implementation is not as horrible as it seems! Mine was
110 lines of C++.
#include <cstdio>
#include <cstring>
#include <map>
#include <algorithm>
#include <queue>
#include <cassert>
using namespace std;
#define MAX_N 1000005
#define MAX_W MAX_N
#define ROOT 0
#define LC(n) (2*(n)+1)
#define RC(n) (2*(n)+2)
struct node {
int left, right, mid;
};
node *trees[MAX_N];
map<int, int> subtract;
int counters[MAX_N];
int N, W, Q, word[MAX_W], searchStr[MAX_N];
priority_queue<pair<int, int> > maxConsPQ; // <max cons in tree, tree id>
int LAST;
int total;
void put(node t[], int cur, int u, int ns, int ne, bool v) {
assert(ns <= u && u <= ne);
if (ns == u && u == ne) {
t[cur].left = t[cur].right = t[cur].mid = v;
} else {
if (u <= (ns+ne) / 2) {
put(t, LC(cur), u, ns, (ns+ne)/2, v);
} else {
put(t, RC(cur), u, (ns + ne)/2 + 1, ne, v);
}
int half = (ne - ns + 1) / 2, mid = 0;
mid = max(mid, t[LC(cur)].right + t[RC(cur)].left);
mid = max(mid, max(t[LC(cur)].mid, t[RC(cur)].mid));
mid = max(mid, max(t[LC(cur)].left, t[RC(cur)].right));
t[cur].mid = mid;
t[cur].left = t[LC(cur)].left;
if (t[LC(cur)].left == half) {
t[cur].left += t[RC(cur)].left;
}
t[cur].right = t[RC(cur)].right;
if (t[RC(cur)].right == half) {
t[cur].right += t[LC(cur)].right;
}
}
}
void change(int pos, int nc, bool isInitial) {
int oc = searchStr[pos];
if (!isInitial && subtract.find(oc) != subtract.end() && pos - subtract[oc] >= 0) {
int po = pos - subtract[oc];
if (W == counters[po]) {
total--;
put(trees[po % W], ROOT, po / W, 0, LAST, false);
maxConsPQ.push(make_pair(trees[po%W][ROOT].mid, po % W));
}
counters[po]--;
assert(counters[po] >= 0);
}
searchStr[pos] = nc;
if (subtract.find(nc) != subtract.end() && pos - subtract[nc] >= 0) {
int pn = pos - subtract[nc];
counters[pn] ++;
assert(counters[pn] <= W);
if (W == counters[pn]) {
total++;
put(trees[pn % W], ROOT, pn / W, 0, LAST, true);
maxConsPQ.push(make_pair(trees[pn%W][ROOT].mid, pn % W));
}
}
}
int query() {
while (!maxConsPQ.empty() && trees[maxConsPQ.top().second][ROOT].mid != maxConsPQ.top().first) {
maxConsPQ.pop();
}
return maxConsPQ.empty() ? 0 : maxConsPQ.top().first;
}
int main() {
scanf("%d %d", &W, &N);
int treeSize = 1;
while (treeSize < N/W) treeSize <<= 1;
LAST = treeSize - 1;
for (int w = 0; w < W; w++) {
scanf("%d", &word[w]);
trees[w] = (node*) malloc(sizeof(node) * 2*treeSize);
memset(trees[w], 0, sizeof(node) * 2*treeSize);
subtract[word[w]] = w;
}
for (int n = 0; n < N; n++) {
scanf("%d", &searchStr[n]);
change(n, searchStr[n], true);
}
scanf("%d", &Q);
for (int q = 0; q < Q; q++) {
int i, x;
scanf("%d %d", &i, &x);
change(i-1, x, false);
printf("%d %d\n", total, query());
}
return 0;
}