Tuesday, September 28, 2010

Finding All Clauses of a Context-Free Grammar

This question was given in week 3 of the 2009 National Computer Science School (NCSS) Challenge. It remains one of my favourite competition problems.


The task is simple. Given a description of a context-free grammar (CFG) in Backus-Naur Form, generate all productions (clauses) of the CFG. Assume the grammar is not recursive (or else there are an infinite amount of clauses).


The solution revolves around the following two observations:
  • If <a> ::= <b> <c> the productions of <a> are the Cartesian product of <b> and <c>
  • If <a> ::= <b> | <c> the productions of <a> are the concatenation of <b> and <c>
For the record, | has the lowest precedence. That means that if <a> | <b> <c> | <d>, the solution is the concatenation of <a>, (<b> x <c>), <d>


The solution naturally lends itself to two parts. We need to parse the grammar into a 
format we can deal with, then we need to generate the productions.



Parsing



BNF is in a key:value format, so it makes sense to use a dictionary. We can use a nested list format for each definition. A list will contain sublists which were separated by | . It's sorta hard to explain so here's an example:


<word> ::= <a> <b> | <c> <d> | <e> "lol i troll you"


Will become

mapping['<word>'] = [['<a>', '<b>'], ['<c>', '<d>'], ['<e>', '"lol i troll you"']]

Note the fact that an entire string in BNF -- including the surrounding quotation marks. We're gonna use some very simple regular expressions to help us identify <these> and "these".

import re

def parse(grammar):
    mapping = {}

    for definition in grammar.strip().splitlines():
        word, meaning = definition.split('::=') # separate key and value                                                                     
        superList = meaning.split('|')
        newSuperList = []

        for ss in superList:
            newSuperList.append(re.findall(r'(<.*?>|".*?")', ss))

        mapping[word.strip()] = newSuperList

    return mapping


Of course, I'm interested in exciting code, even if I have to be a bit ugly about it.

import re
def parse(g):
    return dict([(w.strip(), [re.findall(r'(<.*?>|".*?")', s) for s in m.split('|')]) for w, m in [d.split('::=') for d in g.strip().splitlines()]])

That's better. Where's the fun in readable code?



Generating



Now this is the fun part. We're gonna solve this problem recursively, simply because this problem is a perfect example of one which at first appears complex but can be defined by a set of simple recursive rules. The two recursive rules are in the introduction above. We must also add our break case: we stop recursing whenever we reach a literal (a string in the BNF).

Our recursive function will take in a single parameter: the current term that we're generating clauses from. It will return a list of strings (technically it will be an iterator of strings), where each string is a possible clause in the CFG. If the term parameter is a literal (it matches the pattern ".*") then we've hit the break case and we return a list containing just that term. Otherwise, we expand the term and recurse deeper by applying our recursive rules.


This is a cool challenge, so I'm going to leave you with two options: either figure out how to program that or try to understand my one-liner here. Have fun!

import itertools
def generate(term):
    return re.findall(r'"(.*?)"', term) if re.match('".*', term) else itertools.chain(*[map(''.join, itertools.product(*map(generate, p))) for p in syntax[term]])

Finally, here is the whole code with a CFG provided as an example. It's a very very simplified English sentence CFG. The code works perfectly and because we use itertools.chain() the returned object will be an iterator.


from itertools import chain, product
from re import match, findall

GRAMMAR = '''                                                                                                                                                                         
<sentence> ::= <noun phrase=""> <verb phrase="">                                                                                                                                            
<noun> ::= "boy " | "troll " | "moon " | "telescope "                                                                                                                                 
<transitive verb=""> ::= "hits " | "sees "                                                                                                                                               
<intransitive verb=""> ::= "runs " | "sleeps "                                                                                                                                           
<adjective> ::= "big " | "red "                                                                                                                                                       
<adverb> ::= "quickly " | "quietly "                                                                                                                                                  
<determiner> ::= "a " | "that " | "each " | "every "                                                                                                                                  
<pronoun> ::= "he " | "she " | "it "                                                                                                                                                  
<noun phrase=""> ::= <determiner> <noun> | <determiner> <adjective> <noun>                                                                                                               
<verb phrase=""> ::= <intransitive verb=""> | <transitive verb=""> <noun phrase="">                                                                                                               
'''

def parse(g):
    return dict([(w.strip(), [findall(r'(<.+?>|".+?")', s) for s in m.split('|')]) for w, m in [d.split('::=') for d in g.strip().splitlines()]])

def generate(term):
    return findall(r'"(.*?)"', term) if match('".*', term) else chain(*[map(''.join, product(*map(generate, p))) for p in syntax[term]])

syntax = parse(GRAMMAR)
print list(generate('<sentence>'))

EDIT: Alright, that was a bit mean of me. Here's an expanded version of the generate() function. Whenever I'm obfuscating I always write it up in an expanded version, get it to work, then compress it. In my idiocy, I deleted my expanded 'working' version, so I'm writing this one up on the spot. Note that this one returns a list instead of an iterator.

def generate(term):
    if match(r'".*"', term):
        # if the term is an ebnf string, return it as a single-item list
        #  but strip off the quotation marks
        return findall(r'"(.*?)"', term)
    else:
        # otherwise, solve this subproblem and return a list containing all 
        solutions = []  # this will contain the function's output
        for toConcatenate in syntax[term]:
            expansions = [generate(toExpand) for toExpand in toConcatenate]
            for subsolutions in product(*expansions):
                solutions.append(''.join(subsolutions))
        return solutions

No comments:

Post a Comment