Sunday, July 27, 2014

Don't bound random integers with %

This will be a brief remark that will be the first in a series of posts about random number generation. Rather than discussing generation processes themselves (I'm no where near qualified enough to do that), I'll cover some notions of randomness, tests for randomness, and the role of those tests in software engineering.

You want to generate a random integer between 0 and $n$ inclusive. The usual advice is to do this:

num = random() % (n+1)

Where random() is some standard library function that outputs uniformly distributed random integers. This works. For most purposes, it works well. For cases where a uniform distribution is required, it's dangerous.

The problem is that random() outputs an integer less than some specific maximum. This maximum varies based on your programming language, so we'll refer to it as RAND_MAX – the name of the macro in C. On $k$-bit machines, RAND_MAX is often $2^{k-1} - 1$.

Assume for a second that RAND_MAX = 7, and you want a random number between 0 and 2 inclusive. You do num = random() % 3. Which outcomes of random() give num = 0? 0, 3, 6. What about num = 1? 1, 4, 7. What about num = 2? 2 and 5. Do you see the problem? Recall that the outputs of random() are uniformly randomly distributed, so the probability of num = 0 is $\frac37$, the probability that num = 1 is $\frac37$, but the probability that num = 2 is only $\frac27$. num is thus not uniformly distributed.

Obviously this effect is far less noticeable for larger RAND_MAX values, but it's present nonetheless. In cases where precision is a high priority, this bias is concerning.

So what's the right way? We can do:

do { num = random() } while (num > n)

but the expected number of iterations of this loop grows in inverse proportion to the size of $n$. Better to create a value $n'$ that's the lowest power of 2 not less than $n$, and instead do:

do { num = random() % n' } while (num > n)

This solution takes advantage of the fact that RAND_MAX is one less than a power of 2, so num % n' is uniformly distributed. For instance, in our previous example $n' = 4$, so num = random() % n' gives num = 0 when random() is 0 or 4, num = 1 when random() is 1 or 5, num = 2 when random() is 2 or 6, and num = 3 when random() is 3 or 7. This means that each possible outcome of num occurs with equal probability! However, if num = 3, we have to iterate again. Fortunately $n' < 2n$ so the expected number of iterations is less than 2 – a fair compromise for true uniform randomness.

A final word: I assume that your implementation of random() yields uniformly random integers. Standard library implementations of random() often do not satisfy this. To ensure uniform randomness, use a good random number generator.

In a future post, I'll discuss how to test the conformity of a black-box random number generator to the probability distribution it claims to choose from.

Tuesday, May 27, 2014

Counting Integers with a Particular Binary Property

The Hamming weight or popcount of $n$ is the number of 1s in the binary representation of $n$.

Question: how many positive integers less than $N$ have a popcount divisible by 3?

The naive $O(N \log N)$ solution is to loop through each integer $n < N$ and count the number of bits that are set. This is slow.

Don't fret! There is an $O(\log^2 N)$ solution.

Suppose that $N$ is a power of 2. Then its binary representation is 1 followed by $\lg(N)$ 0s. We can count the number of positive $n\leq N$ with $\mathrm{popcount}(n)$ by simply doing $C(\lg(N), 3) + C(\lg(N), 6) + ... + C(\lg(N), 3k)$ while $3k \leq \lg(N)$. For instance, when $N=16$, $C(4,3) = 4$ which is correct.

Next, suppose that $\mathrm{popcount}(N) = 2$, (i.e. $N$ has exactly two bits set, i.e. N$ $is the sum of a pair of distinct powers of 2). E.g. say $N = 20$ (binary 10100). The solution that worked for $N=16$ will still work. Now in a sense we've "considered" that power of 2. Then consider the next power of 2 that makes up the number, 4. 4 has 2 0's following, so we use the same principle as before! However, we remember that the bit representing 16 has already been set, so we're no longer looking for popcounts that are multiples of 3; instead, we're looking for popcounts which give a remainder of 2 modulo 3. Mathematically, we do $C(\lg(4), 2) + ... + C(\lg(4), 3k-1)$ while $3k-1 \leq 4$. In this case, we have only $C(2,2) = 1$. This corresponds to $n = 19$ (binary 10011). So for $N=20$, the answer is $4+1=5$.

OK now we generalise the pattern.

from euler import popcount, binom
from numpy import log2

def solve(N):
    ans = 0
    num_considered = 0
    p = 2**60
    oldN = N
    while p >= 1:
        if N & p:
            if (num_considered + 1) % 3 == 0:
                ans += 1
            mult = 3 - (num_considered % 3)
            while mult <= log2(p):
                ans += binom(log2(p), mult)
                mult += 3
            num_considered += 1
        p //= 2
    return ans

I'm not even sure that this is $O(\log^2 N)$... there are two nested loops that each run $O(\log n)$ times, and inside the nested loop we call a binomial function which in theory is $O(nk)$, and we call it with $n,k \leq \log N$ producing an extra $O(\log^2 N)$ multiplier. But due to the ranges of the arguments we use for the binomial function, I'm pretty sure each call is amortized constant time. Maybe.

euler is my Project Euler helper functions module which is available on my Github.

Thursday, May 8, 2014

P vs NP for Dummies

I seem to type up a variation of this post every few months. I'm saving this one for the permanent record.


P vs NP for Dummies


There are some yes/no problems which are easy. If I give you a list on numbers and ask you to tell me if the number 5 is in that list, you can solve it easily by looking at each number in the list. In the worst case, when the number 5 is at the very end of the list, this takes about n steps/instructions.
There are yes/no problems which are slightly harder -- for example, I might ask you to tell me if 12*14=168. When you multiply two n-digit numbers using the method you learnt in school, you need about n2 steps.
Both of these problems require some number of steps that's bounded by a polynomial of the input size; i.e. for each of these problems there is a fixed k so that the number of steps required to solve an instance of that problem of size n it is always bounded by nk . This class of problems is called P which is a terrible pseudo-abbreviation for 'polynomial time on a deterministic Turing machine' (a deterministic Turing machine is a standard computer, more or less).
Here's another yes/no problem: given a map of a country with n cities, determine whether there's a route that can be taken that visits every city exactly once, and whose total distance is less that a given distance d. If we are told that the answer is yes and we are given the corresponding route, we can easily check its total distance by simply following the route and adding up all the distances between cities. This verification process takes about n steps since you're adding n-1 numbers together. The set of yes/no problems whose solutions can be verified in polynomial time (given some extra information, in this case the route itself) is called NP. This specific problem is a special case called an NP-complete problem, meaning that it is in a sense 'harder' than all other problems in NP.
One might conjecture that, because you can verify a solution to this problem in a polynomial number of steps, then perhaps you can also solve it from scratch in polynomial time? Obviously all problems in P are also in NP, since if we can verify a solution to any problem in P in polynomial time by simply solving that problem and checking it against the given solution. Since this problem is 'harder' than all the other NP problems, if we can solve this one in polynomial time, then we can solve all the other NP problems in polynomial time too! This would mean that all NP problems are also P problems, in other words P = NP.
So you set out to try to find a polynomial time method to produce a solution to this problem. But you can't. No one can. We don't have a polynomial time method to solve for this problem. In fact we don't have a polynomial time method for any of the dozens of known NP-complete problems. These problems range from finding routes across maps, to packing items efficiently into a backpack, to optimally playing candy crush. On the flip side though, no one's managed to prove that a polynomial time method doesn't exist. Such a proof would demonstrate that P =/= NP.
So there you are. The problem of P vs NP is the problem of proving either that P=NP (this can be done simply by finding a polynomial time method to solve problem above) or that P =/= NP (this intuitively seems harder to prove).
For the route-finding problem described above (it's called the travelling salesman problem), the best currently known method is a small optimisation on "brute force every permutation of cities to visit". This method bound the number of steps by about n2 * 2n. If you can get rid of the exponential term in that expression, you get a million dollar prize and immediately render most of the field of cryptography obsolete.

Sunday, February 16, 2014

An Original String Manipulation Problem

This is an original string manipulation problem that I wrote. It was included on the 2014 Australian Invitational Informatics Olympiad. In my opinion it would be a question of moderate difficulty on an IOI exam.
You are given a critical string of length $C$ in some arbitrarily large alphabet $\Sigma$ and some text of length $T$ in $\Sigma$ ($C \leq T \leq 10^6$). You are guaranteed that each character in $\Sigma$ appears at most once in the critical string. You are then instructed to perform $U$ update operations; each operation is of the form "set the character at position $i$ in the text to be $x$" ($0 \leq i < T, x \in \Sigma, U \leq 10^6$) . After each update operation, output the maximum number of consecutive occurrences ("runs") of the critical string in the text. The intended time complexity is linearithmic.
The difficulty in this question arises from the number of different sub-algorithms/techniques (three or four of them) that need to be applied to construct a complete solution. What follows is my solution.
Before we can track runs of the critical word within the text, we first need the ability to detect the formation of individual instances of the critical word. Intuitively, some update operations make us closer to a critical word; e.g. if the critical word is "badger", then an update operation "booger" $\to$ "bodger" gets us closer to "badger". Inversely, "bodger" $\to$ "booger" gets us further from the formation of the critical word. Specifically, some $C$-length substring of the text gets closer/further if its edit distance to the critical string (the number of characters that differ across both strings) decreases/increases respectively.
The following observation drives our first technique.

Observation 1. Each update operation results in a decrease in edit distance to at most one instance of a $C$-length substring of the text, and an increase in edit distance to most one instance of a $C$-length substring in the text.
This is a result of the character-distinctness property of the critical string. Say we set a character in the text to 'a'. If the critical string does not contain 'a', then no text substring's edit distance could have decreased. If the critical string contains one occurrence of 'a' at position $p_a$, then only the text substring whose $p_a$th character has become $a$ has a reduction in edit distance. The critical string cannot contain more than one occurrence of 'a' so we need not consider more cases.
Similarly, if we are replacing the character 'b' with 'a', then an increase an edit distance occurs only for the substring whose $p_b$th character is replaced, or for no strings if 'b' does not appear in the critical string.

Sub-algorithm 1. By maintaining a map of character to critical string position (i.e. $c \to p_c$) we can find which $C$-length substring gets closer (if any) and which $C$-length substring gets further (if any) in constant time.
If we are updating position $i$ in the text from character $c$ to character $d$, the substring that gets closer begins at position $i - p_c$ in the text and the substring that gets further begins at position $i - p_d$ (all positions are 0-based by the way).

Sub-algorithm 2. Store, for each position $i \in 0..T-C+1$ in the text, the edit distance of the $C$-length substring beginning at position $i$ in the text. After each update operation, make the zero, one, or two necessary modifications to these values (as indicated by sub-algorithm 1). After replacing the character $c$ at position $i$ with $d$, we can determine in constant time if a critical word instance has been formed or un-formed by querying the edit distance at $i-p_c$ and $i-p_d$. This is all constant time.

Sub-algorithm 2 allow us to detect formations and un-formations of the critical word in the text as we go, in $O(1)$ time for each update. The next step is to detect runs of the critical word in the text. To do this, we introduce a complex data structure into the equation.
I'll begin by posing a more straight-forward problem. We have a binary string. We then perform a series of updates on the string. In each update we set or unset a bit in the string. After each update, we need to output the length of the longest run of 1's in the string.
In fact, we can implement a range tree that handles these exact update and query operations. I won't go into the implementation details, but I will mention what each tree node stores:
  • The longest run in the interval contained by the node
  • The size of the run that is incident to the left side of the node's interval
  • The size of the run that is incident to the right side of the node's interval
We'll call this the longest-run tree. Using this data structure (in fact, $C$ different instances of it!), we get our third technique:

Sub-algorithm 3. Instantiate $C$ longest-run trees, where each represents a different 'offset' of a critical string run within the text. The first 'offset' has the first letter of the critical string at indices 0, $C, 2C, ...$ within the text, the second 'offset' has the first letter at indices $1, C+1, 2C+1, ...$, etc.
Each time we detect the completion of a critical word within the text (using sub-algorithm 2), let the position of the beginning of the critical word be $p_0$. Take the $(p_0 \bmod C)$'th longest-run tree and "set the bit" in the tree at index $\left\lfloor \frac{p_0}{C} \right\rfloor$. Do the same for any critical words that we unformed -- find their offset's tree, find the position in the tree, and unset the bit.
Now if we query that tree, we discover the longest run of consecutive critical words in that 'offset'. Setting and unsetting bits is $O(\lg T)$ and querying is also $O(\lg T)$.

We are almost done! After making an update, we can query the longest run for a particular offset of the critical word in logarithmic time. But how do we know if this offset contains the maximum-length run?
For this, we introduce yet another data structure (an easier one this time). We use a max-first priority queue.

Sub-algorithm 4. Initialise a priority queue of integer pairs where each pair is of the form (maximum run-length in tree, tree ID), sorted in descending order by the former. Each time we "set a bit" in a tree, query the maximum run-length of that tree. Push onto the priority queue the maximum run-length in that tree as well as the tree ID.
Now when it comes time to output the maximum run-length of the critical string, query the priority queue. This will give you the maximum run-length across all trees, as well as the tree in which that maximum run-length occurs. However, we can't necessarily trust this value! What if we'd pushed it to the queue a long time ago, and subsequent updates have removed that run of the critical word? To cater for this, we must now query the tree with the tree ID that we just popped to ensure its run length is truly the longest. If it's not, we ignore this priority queue element, pop it off and try the next one.
You can also use a set as your priority queue, and erase elements when you unset bits rather than the lazy deletion we're using.

Each time we do an update, we push something to the queue. We might set a bit up to $T$ times, so potentially we could have $T$ elements on the queue (though this is really unlikely). A queue with $T$ elements has $O(\lg T)$-time operations. Each time we do an update, we potentially pop several things off the queue, but since the total number of elements on the queue is bounded, this amortizes to one pop per update. So the time complexity of each update is the time complexity of a constant number of range tree operations followed by an amortized constant number of priority queue operations. $O(\lg T + \lg U) = O(\log TU)$.
Since we do this for each update, we end up with $O(U \log TU)$ as required.

The implementation is not as horrible as it seems! Mine was 110 lines of C++.
#include <cstdio>
#include <cstring>
#include <map>
#include <algorithm>
#include <queue>
#include <cassert>
using namespace std;
#define MAX_N 1000005
#define MAX_W MAX_N
#define ROOT 0
#define LC(n) (2*(n)+1)
#define RC(n) (2*(n)+2)

struct node {
    int left, right, mid;
};

node *trees[MAX_N];
map<int, int> subtract;
int counters[MAX_N];
int N, W, Q, word[MAX_W], searchStr[MAX_N];
priority_queue<pair<int, int> > maxConsPQ; // <max cons in tree, tree id>
int LAST;
int total;

void put(node t[], int cur, int u, int ns, int ne, bool v) {
    assert(ns <= u && u <= ne);
    if (ns == u && u == ne) {
        t[cur].left = t[cur].right = t[cur].mid = v;
    } else {
        if (u <= (ns+ne) / 2) {
            put(t, LC(cur), u, ns, (ns+ne)/2, v);
        } else {
            put(t, RC(cur), u, (ns + ne)/2 + 1, ne, v);
        }
        int half = (ne - ns + 1) / 2, mid = 0;
        mid = max(mid, t[LC(cur)].right + t[RC(cur)].left);
        mid = max(mid, max(t[LC(cur)].mid, t[RC(cur)].mid));
        mid = max(mid, max(t[LC(cur)].left, t[RC(cur)].right));
        t[cur].mid = mid;
        t[cur].left = t[LC(cur)].left;
        if (t[LC(cur)].left == half) {
            t[cur].left += t[RC(cur)].left;
        }
        t[cur].right = t[RC(cur)].right;
        if (t[RC(cur)].right == half) {
            t[cur].right += t[LC(cur)].right;
        }
    }
}

void change(int pos, int nc, bool isInitial) {
    int oc = searchStr[pos];
    if (!isInitial && subtract.find(oc) != subtract.end() && pos - subtract[oc] >= 0) {
        int po = pos - subtract[oc];
        if (W == counters[po]) {
            total--;
            put(trees[po % W], ROOT, po / W, 0, LAST, false);
            maxConsPQ.push(make_pair(trees[po%W][ROOT].mid, po % W));
        }
        counters[po]--;
        assert(counters[po] >= 0);
    }
    searchStr[pos] = nc;
    if (subtract.find(nc) != subtract.end() && pos - subtract[nc] >= 0) {
        int pn = pos - subtract[nc];
        counters[pn] ++;
        assert(counters[pn] <= W);
        if (W == counters[pn]) {
            total++;
            put(trees[pn % W], ROOT, pn / W, 0, LAST, true);
            maxConsPQ.push(make_pair(trees[pn%W][ROOT].mid, pn % W));
        }
    }
}

int query() {
    while (!maxConsPQ.empty() && trees[maxConsPQ.top().second][ROOT].mid != maxConsPQ.top().first) {
        maxConsPQ.pop();
    }
    return maxConsPQ.empty() ? 0 : maxConsPQ.top().first;
}

int main() {
    scanf("%d %d", &W, &N);
    int treeSize = 1;
    while (treeSize < N/W) treeSize <<= 1;
    LAST = treeSize - 1;

    for (int w = 0; w < W; w++) {
        scanf("%d", &word[w]);
        trees[w] = (node*) malloc(sizeof(node) * 2*treeSize);
        memset(trees[w], 0, sizeof(node) * 2*treeSize);
        subtract[word[w]] = w;
    }
    for (int n = 0; n < N; n++) {
        scanf("%d", &searchStr[n]);
        change(n, searchStr[n], true);
    }

    scanf("%d", &Q);
    for (int q = 0; q < Q; q++) {
        int i, x;
        scanf("%d %d", &i, &x);
        change(i-1, x, false);
        printf("%d %d\n", total, query());
    }

    return 0;
}

Thursday, January 30, 2014

NFA to DFA

I've started reading a theory of computation textbook in my spare time. In the first few chapters, I re-learnt the algorithm for converting a non-deterministic finite automaton to a deterministic finite automaton that recognises the same language. I say "relearnt" because I actually learnt the same algorithm 2 years ago at the National Computer Science Summer School.

2 years ago, implementing the algorithm was one of the most frustrating (to debug and to reason about) programming tasks I'd ever done. If you're not familiar with the algorithm, it essentially involves breadth-first searching over a graph -- except it's more like a metagraph because each node represents a subset of the states (i.e. nodes) in the NFA (which itself is a graph). So in this metagraph, $A \xrightarrow{c} B$ iff the $B$'s state subset is the union of all the states reachable by following a $c$-transition from any of the states in $A$'s state subset. There's some extra stuff to handle $\epsilon$-transitions too.

Debugging this algorithm was as hard as it sounds. Here's the code from that troublesome day.

EPS = '~'

class NFANode(object):
    def __init__(self, is_final):
        self.is_final = is_final
        self.edges = set()

    def add(self, toks, next):
        for t in toks:
            self.edges.add((t, next))

    def EC(self):
        nodes = {self}
        for tok, next in self.edges:
            if tok == EPS:
                nodes.update(next.EC())
        return nodes


class DFANode(object):
    def __init__(self):
        self.edges = []
        self.states = set()
        
    def is_final(self):
        return any(k.is_final for k in self.states)
    
    def EC(self):
        nodes = set()
        for nfaNode in self.states:
            nodes.add(nfaNode)
            for tok, next in nfaNode.edges:
                if tok == EPS:
                    nodes.update(next.EC())
        dfaNode = DFANode()
        dfaNode.states = nodes
        return dfaNode

    def M(self, queryTok):
        nodes = set()
        for nfaNode in self.states:
            for tok, next in nfaNode.edges:
                if tok == queryTok:
                    nodes.add(next)
        dfaNode = DFANode()
        dfaNode.states = nodes
        return dfaNode


class NFA(object):
    def __init__(self, start):
        self.start = start

    def get_toks(self):
        tok_set = set()
        self.accepts(EPS*1000, tok_set)
        return tok_set.difference({EPS})
    
    def accepts(self, query, all_toks=None):
        """
        >>> n1, n2, n3 = NFANode(True), NFANode(False), NFANode(False)
        >>> n1.add('b', n2)
        >>> n1.add(EPS, n3)
        >>> n2.add('a', n2)
        >>> n2.add('ab', n3) 
        >>> n3.add('a', n1)
        >>> nfa = NFA(n1)
        >>> assert nfa.accepts('aaa')
        >>> assert not nfa.accepts('bb')
        >>> assert nfa.accepts('abababababababababbababababa')
        >>> assert nfa.accepts('baa')
        >>> assert nfa.accepts('baaaaaa')
        >>> assert nfa.accepts('aa')
        >>> assert nfa.accepts('baba')
        >>> assert not nfa.accepts('baababaaaaaaaaaaaaaaaaaab')
        >>> assert nfa.accepts('baababaaaaaaaaaaaaaaaaaaba')
        """
        q = [(0, self.start)]
        seen = set()
        
        while q:            
            pos, node = q.pop(0)
            if (pos, node) in seen:
                continue
            seen.add((pos, node))

            if pos == len(query):
                if node.is_final:
                    return True
            else:
                for (tok, next) in node.edges:
                    if isinstance(all_toks, set):
                        all_toks.add(tok)
                    if tok == EPS or (tok == query[pos] and pos < len(query)):
                        q.append((pos+(tok!=EPS), next))
        return False


class DFA(NFA):
    def __init__(self, nfaToConvert):
        self.nfa = nfaToConvert
        self.all_toks = self.nfa.get_toks()
        self.start = DFANode()
        self.start.states.add(self.nfa.start)
        # mapping of states tuple to DFANode
        self.cache = {}
        self._convert()

    def _convert(self):
        self.start = self.start.EC()
        self.cache[tuple(self.start.states)] = self.start
        
        q = [self.start]
        seen = set()

        while q:
            states = q.pop(0)
            for tok in self.all_toks:
                newDFANode = self.retrieve(states.M(tok).EC())
                if newDFANode.states:
                    states.edges.append((tok, newDFANode))
                    if tuple(newDFANode.states) not in seen:
                        seen.add(tuple(newDFANode.states))
                        q.append(newDFANode)
    
    def retrieve(self, node):
        if tuple(node.states) not in self.cache:
            self.cache[tuple(node.states)] = node
        return self.cache[tuple(node.states)]

    def accepts(self, query):
        """
        >>> n1, n2, n3 = NFANode(True), NFANode(False), NFANode(False)
        >>> n1.add('b', n2)
        >>> n1.add(EPS, n3)
        >>> n2.add('a', n2)
        >>> n2.add('ab', n3) 
        >>> n3.add('a', n1)
        >>> nfa = NFA(n1)
        >>> dfa = DFA(nfa)
        >>> assert dfa.accepts('aaa')
        >>> assert not dfa.accepts('bb')
        >>> assert dfa.accepts('abababababababababbababababa')
        >>> assert dfa.accepts('baa')
        >>> assert dfa.accepts('baaaaaa')
        >>> assert dfa.accepts('aa')
        >>> assert dfa.accepts('baba')
        >>> assert not dfa.accepts('baababaaaaaaaaaaaaaaaaaab')
        >>> assert dfa.accepts('baababaaaaaaaaaaaaaaaaaaba')
        """
        q = [(0, self.start)]
        seen = set()

        while q:            
            pos, node = q.pop(0)
            if (pos, node) in seen:
                continue
            seen.add((pos, node))

            if pos == len(query):
                if node.is_final():
                    return True
            else:
                for (tok, next) in node.edges:
                    if tok == query[pos] and pos < len(query):
                        q.append((pos+(tok!=EPS), next))
        return False


if __name__ == '__main__':
    import doctest
    doctest.testmod()