OK, are you ready to get Properly Object-Oriented? The new version of the monadic parser combinators from parts 1 and 2 of this series will have inheritance, polymorphism and information hiding like nobody’s business. Let’s start off with the Abstract Base Class*, Parser.
*Abstract Base Class not actually used. This post uses Python 2.5. This post offers no guarantees of suitability or completeness; all holes in post left as reader exercises.
class Parser():
def parse(self, s, i):
raise NotImplementedError()
def run(self, s):
(s,i), answer = self.parse(s, 0)
if i==len(s) and not isinstance(answer, Exception):
return answer
else:
raise answer
def runsexp(s):
return Bind(Chain(SkipSpace(), Sexp()),
lambda answer: Chain(SkipSpace(), Return(answer))).run(s)
A parser is now a class with a parse method that takes a string and an index and returns a tuple of (string,index) along with a value. This regularises the usage and construction of parsers a bit, as you can see in runsexp as a bit of a preview. run and runsexp are now properly factored into two pieces; run will run any parser, while runsexp runs our particular S-expression grammar on a parser.
Here are the simple parser combinators. They’re mostly the same, but are now OO. They inherit from Parser and implement the polymorphic parse method. And uh, they don’t really hide information. But you will notice that the inner functions named parse from last time have now become Parser.parse.
class Atom(Parser):
def parse(self, s, i):
if i >= len(s): return Fail("Ran off end of string").parse(s,i)
start = i
while i < len (s) and not s[i].isspace() and not s[i]==')' and not s[i]=='(':
i += 1
if i==start:
return Fail("Atom not found at %d; '%s' found instead." % (i, s[i])).parse(s,i)
else:
return (s,i), s[start:i]
class SkipSpace(Parser):
def parse(self, s, i):
while i < len(s) and s[i].isspace():
i += 1
return (s,i), ()
class Char(Parser):
def __init__(self, c):
self.c = c
def parse(self, s, i):
if i >= len(s): return Fail("Ran off end of string").parse(s, i)
if s[i]==self.c:
return (s,i+1), self.c
else:
return Fail("'%s' expected at %d, but '%s' found instead. [%s]" %
(self.c,i,s[i],s[max(0,i-5):min(len(s), i+5)])).parse(s, i)
Here are the monadic parser combinators:
class Chain(Parser):
def __init__(self, parser1, parser2):
self.parser1 = parser1
self.parser2 = parser2
def parse(self, s, i):
(s,i), a = self.parser1.parse(s, i)
if isinstance(a, Exception): return (s,i), a
else: return self.parser2.parse(s, i)
class Bind(Parser):
def __init__(self, parser1, f):
self.parser1 = parser1
self.f = f
def parse(self, s, i):
(s,i), a = self.parser1.parse(s, i)
if isinstance(a, Exception): return (s,i), a
else: return self.f(a).parse(s, i)
class Return(Parser):
def __init__(self, x):
self.x = x
def parse(self, s, i):
return (s,i), self.x
class Fail(Parser):
def __init__(self, msg):
self.msg = msg
def parse(self, s, i):
return (s,i), Exception(self.msg)
class Or(Parser):
def __init__(self, *alts):
self.alts = alts
def parse(self, s, i):
for alt in self.alts:
(snoo,inew), a = alt.parse(s, i)
if not isinstance(a, Exception):
return (snoo,inew), a
return (snoo,inew), a # (s,i), a
This is all very nice, but I would like to point out that it’s much more verbose than the functional style. Unless you are an OO zealot, the only two reasons to make this change would be to (1) help fellow programmers who can’t read functional style yet or (2) take advantage of operator overloading, which is only available to classes in Python.
In fact, changing the parser superclass to the following will let us use nicer syntax. We just need to add a few magic methods to the base class:
class Parser():
def parse(self, s, i):
raise NotImplementedError()
def run(self, s):
(s,i), answer = self.parse(s, 0)
if i==len(s) and not isinstance(answer, Exception):
return answer
else:
raise answer
def __rshift__(parser1, parser2):
return Chain(parser1, parser2)
def __getitem__(parser, f):
return Bind(parser, f)
def __or__(parser1, parser2):
return Or(parser1, parser2)
This change allows us to write Sexp and SexpList like this:
class Sexp(Parser):
def parse(self, s, i):
return ((Char('(') >> SexpList())[lambda sexps:
SkipSpace() >> Char(')') >> Return((sexps[0][0], sexps[1:]))]
| Atom()[lambda a: Return((a, []))]
| Fail("Couldn't find closing parenthesis")).parse(s, i)
class SexpList(Parser):
def parse(self, s, i):
return ((SkipSpace() >> Sexp())[lambda sexp1:
SexpList()[lambda rest: Return([sexp1]+rest)]]
| Return([])).parse(s, i)
I’m not sure how much I like the bind syntax—I just chose something that would let me leave off parentheses around lambda—but the other two operators are clear winners, I think. Writing SexpList like this also shows that the recursive definition is now simpler than the manual implementation with a
while-loop.
For the final entry in this series, I’ll look at the performance of the three variants of this style. I may also look at the performance of a hand-rolled state machine, a recursive descent parser or a yacc-generated parser. But I really doubt it.
In part 1, I showed a fairly annoying but nicely regular method of parsing. Today I will show how a lot of the annoyingness can be hidden with a few driver functions. These functions will allow us to succinctly combine the simple parsers like char, skipspace and atom into complex parsers.
The functions are monadic, which means that they follow a well-known design pattern from functional programming. It’s not really important here because (1) I don’t generalise beyond parsing and (2) Python’s syntax makes it hard to make them convenient to use. But that’s why I chose some odd names like ‘ret’, short for return. Here’s the code:
def chain(parser1, parser2):
def chainedparse(s, i):
(s,i), a = parser1(s, i)
if isinstance(a, Exception): return (s,i), a
return parser2(s, i)
return chainedparse
def bind(parser, f):
def parse(s,i):
(s,i), a = parser(s,i)
if isinstance(a, Exception): return (s,i), a
return f(a)(s,i)
return parse
def ret(x):
def parse(s,i):
return (s,i), x
return parse
def fail(msg):
def parse(s,i):
return (s,i), Exception(msg)
return parse
The easiest to explain is chain, which runs two parsers but only returns the result of the second parser. This is useful for parsers like skipspace, where you can write (s,i), a = chain(skipspace, atom)(s,i) and avoid creating a bunch of variables named _. You can see that chain implements a single step of the boilerplate as well as the error checking that I was too lazy to put in by hand before. Before, code like sexp was a long chain of
: (s,i), sexps = sexplist(s, i) (s,i), _ = skipspace(s, i) :
Bind is similar to chain, but the result of the first parser is passed to the second parser so it can do something useful with it. Actually, you have to wrap the second parser in a single-argument function. The parser interface doesn’t have any additional space to receive the input. But bind takes care of basically the same boilerplate as chain does. Note that if the first parser returns an exception, the second parser is never run, so an Exception short-circuits evaluation like you’d expect.
Ret takes care of the case when you want to return a value from a parser without having to add in the (s,i) boilerplate. It’s particularly useful in conjuction with bind: you can say something like bind(atom, lambda a: ret((a, []))), which gives you a parser that parses an atom and wraps it in a tuple.
Fail does basically the same thing as ret, except that it returns an exception with a message. You’ll also notice that all of these functions immediately return a nested function. According to my current model, the inner function is the parser. For example, in chain, the inner function is a parser that chains together the two parsers passed to the outer function. The chained parser can then be used just like any other function that implements the parser interface. This is the combinator part of the “monadic parser combinators". Besides the monadic combinators, there are other useful ones, like or (called alt below):
def alt(*alts):
def parse(s,i):
for alt in alts:
(snoo, inew), a = alt(s,i)
if not isinstance(a, Exception):
return (snoo,inew), a
return (s,i), a # maybe return (snoo,inew), a ?
return parse
Alt tries each parser in turn and returns the first one that succeeds. If none succeed, it returns the exception from the last alternative.
Now let’s look at how the code for complex parsers changes with these combinators, sexp and sexplist:
def sexp(s, i):
(s,i), c = char('(')(s, i)
if c=='(':
return bind(sexplist, lambda sexps:
chain(skipspace,
chain(char(')'),
ret((sexps[0][0], sexps[1:])))))(s, i)
else: # c isinstance Exception
return bind(atom, lambda a: ret((a, [])))(s, i)
def sexplist(s, i):
(s,i), l = chain(skipspace, sexp)(s, i)
if isinstance(l, Exception):
return ret([])(s,i)
else:
return bind(sexplist, lambda rest:ret([l]+rest))(s, i)
(run and the simple parsers change very little, except to use fail, so I’ll not repeat them here.)
Actually, I didn’t use alt there. If Python supported recursive definitions or lazy definitions, alt would allow us to write point-free combinators:
sexp = alt(bind(chain(char('('), sexplist), lambda sexps:
chain(skipspace, chain(char(')'), ret((sexps[0][0], sexps[1:]))))),
bind(atom, lambda a: ret((a, []))),
fail("Couldn't find closing parenthesis"))
sexplist = alt(bind(chain(skipspace, sexp), lambda sexp1:
bind(sexplist, lambda rest:
ret([sexp1]+rest))),
ret([]))
A point-free combinator is one with no arguments. In other words, since alt creates a function that conforms to the parser interface, you don’t need to explicitly create your own parser function. However, Python doesn’t support recursive definitions because of its strict top-to-bottom evaluation. So we have to wrap the
alt-parser versions in a function definition in order to manually delay evaluation:
def sexp(s,i):
return alt(bind(chain(char('('), sexplist), lambda sexps:
chain(skipspace, chain(char(')'), ret((sexps[0][0], sexps[1:]))))),
bind(atom, lambda a: ret((a, []))),
fail("Couldn't find closing parenthesis"))(s,i)
def sexplist(s,i):
return alt(bind(chain(skipspace, sexp), lambda sexp1:
bind(sexplist, lambda rest:
ret([sexp1]+rest))),
ret([]))(s,i)
This is about the end of the road for semantic elimination of boilerplate. From here on out, it’s all syntactic tricks to make the highly nested style above more palatable; I’m no longer naive enough to think that most people will want the 4-line point-free version if one of the lines ends with 5 (five!) closing parentheses. That’s what I’ll discuss next time.
Every time I tell myself I should lighten up and learn to enjoy Python’s lightweight imperative implementation of laziness, a non-local-dependency bug bites me. Here’s the most recent one:
In a script, I changed the following fix-up code from
def fix(node):
return '-'.join(map(str.strip, reversed(''.join(node).split(','))))
to
def fix(node):
snode = ''.join(node)
if snode==')':
return 'RRB'
elif snode=='(':
return 'LRB'
else:
return '-'.join(map(str.strip, reversed(''.join(node).split(','))))
Can you see the bug? No? Well, I do a little work twice, joining a list of chars into a string: snode = ‘’.join[node]. I wasn’t sure that this ad-hoc checking for parentheses would work so I didn’t want to refactor the original expression, which also contains ‘’.join(node).
The problem is that node is not a list of chars, but a generator of chars. So it’s use-once. And I used it twice, but the second time, it was unfortunately empty. Oops. Even less fortunately, my program didn’t crash when fix returned an empty string for most inputs, so I didn’t know about this until my script had run for about 5 minutes.
To diagnose this bug, I had to jump up two functions to the main code, then back down two functions into another module. This code is quite lazy, with yield and generators all over the place, so it doesn’t actually read the individual nodes until much later, when fix is called. I thought laziness was the New Python 3.0 Way, but apparently the New Python Way is susceptible to hairy non-local bugs. This is the essence of laziness, to collapse all your nicely-separated code into one giant executing mess, but running into bugs based on the non-locality of the implementation means that the implementation has holes in its abstraction layer.
The use-case is: I create a generator somewhere, then return, pass it around a lot, and finally use the generator twice somewhere else entirely. Remember, this is Python, so I am not constantly reminding myself about the type of something, and, from previous versions of Python and other languages, I expect to be able to use variables more than once.
For example, C# prevents this “use-once” bug by wrapping an additional layer of abstraction around its enumerators. When you create a generator, you don’t get an Enumerator, you get an Enumerable. Using the Enumerable creates a new Enumerator each time, so my above program would have become twice as slow, but at least it wouldn’t have become incorrect.
This is Bad and Wrong and I would suspect that Python 3.4 or whatever will adopt the C# model, *except* that it’s such a glaring problem that somebody surely noticed it during development of 2.4-3.0. It could be that Guido doesn’t like the use-case I present here.
Well, after the depressing episode with iterators yesterday, it feels kind of weird to show the other end of s-expressions today: parsing. I normally would not write this in Python, but I need to parse s-expressions into trees so I can use an existing Python library to process it. Since I was sort of on vacation for the last couple of weeks, I spent way more time than necessary building monadic parser combinators to do the parsing.
As always, I feel obligated to link to a more serious implementation of monadic parser combinators in Python in case you really want to use them yourself.
This is a gratuitous explanation of monadic parsing in Python. I’m on vacation, so I decided to write my own s-expression parser instead of finding some code or using Python’s equivalent of yacc. Here’s the s-expression grammar I intended to use:
However, I didn’t come up with a good way to implement +, or to integrate regexes into my system. So here’s the grammar I actually used:
This should work fine for the sentences I need to parse, for example this one:
test2 ="" " (S (VVPS POPPHH) (VNDD HVPS) (VVIV VVSN) (NP (CONJP (NN PR) (IK++ PN) (NN PR) (IK++ NNDDSS) (NN PORP) (IK++ POOP) (NN VVPT) (RO PR)) (NN__SS PODP)) (NP (PN____GG NNDD) (NN IP))) """
The output of the parser will be a tuple tree, which in Haskell would have the following type:
data TupleTree a = TupleTree (a, [TupleTree a])
Since Python doesn’t support pattern matching, only having one constructor is a nice feature. Leaves are distinguished by having no children.
So anyway. I’m going to implement monadic parser combinators in Python; if you don’t know what those are, stick around—they are a lot simpler than they sound. The word ‘monadic’ means that the code uses a specific interface to do its work, one that is (1) purely functional (no side effects) and (2) easy to wrap up in a nice interface. The word ‘parser’ is obvious (I hope), and the word ‘combinator’ just means that simple parsers can be combined with operators like OR to produce complex parsers.
The simplest parser is skipspace. It just skips spaces without returning them.
def skipspace(s, i):
while i < len (s) and s[i].isspace():
i += 1
return (s, i), ()
As you can see, I have defined a parser as a function that takes a string and an index and returns the string, a new index, and a value. Since skipspaces doesn’t produce a value, it returns the empty tuple for its value. Of *course*, since this is Python, you could choose to model all this stuff as classes. I will eventually, but it adds a lot of boilerplate to the implementation. Unfortunately, this means that you’ll have to understand nested functions in the meantime.
def char(c):
def parse(s, i):
if i >= len(s):
return (s,i), Exception("Ran off end of string")
if s[i]==c:
return (s, i+1), c
else:
return (s,i), Exception("'%s' expected at %d, but '%s' was found instead. [%s]" %
(c,i,s[i], s[max(0, i-5):min(len(s), i+5)]))
return parse
def atom(s, i):
if i >= len(s): return (s,i), Exception("Ran off end of string")
start = i
while i < len (s) and not s[i].isspace() and not s[i]==')' and not s[i]=='(':
i += 1
if i==start:
return (s,i), Exception("Atom not found at %d; '%s' found instead." % (i, s[i]))
else:
return (s, i), s[start:i]
char and atom are a bit more complicated than skipspace because they return a value and also give debugging info. This fills in one more piece of the interface that I skipped earlier; if the value is an Exception, then it indicates that a parse error has occurred. Notice that the exception is returned, not raised. The driver code I show later will raise the exception.
So far the code I have shown is very low-level and not much like the grammar I want to copy. Let’s fix that with sexp and sexplist. They closely reflect the grammar specification:
def sexp(s, i):
(s,i), c = char('(')(s, i)
if c=='(':
(s,i), sexps = sexplist(s, i)
(s,i), _ = skipspace(s, i)
(s,i), _ = char(')')(s, i)
return (s,i), (sexps[0][0], sexps[1:])
else: # c isinstance Exception
(s,i), a = atom(s, i)
if isinstance(a, Exception):
return (s,i), a
else:
return (s,i), (a, [])
def sexplist(s, i):
(s,i), _ = skipspace(s, i)
(s,i), sexp1 = sexp(s, i)
if isinstance(sexp1, Exception):
return (s,i), []
else:
(s,i), rest = sexplist(s, i)
if isinstance(rest, Exception):
return (s,i), rest
else:
return (s,i), [sexp1]+rest
So let’s see…sexp checks for an open parenthesis and if it finds one, looks for a sexplist and then a close parenthesis. Otherwise, it returns an atom. Either way, the value returned is a pair of atom and its children–but a leaf has no children. sexplist is similar, except that it looks for an initial sexp and returns [] if it can’t find it. (Note: I wrote sexplist recursively, but a while-loop version with a list accumulator would probably be cleaner in Python.)
def run(s):
(s,i), _ = skipspace(s, 0)
(s,i), answer = sexp(s, i)
(s,i), _ = skipspace(s, i)
if i==len(s) and not isinstance(answer, Exception):
return answer
else:
raise answer
There are a few things to notice about this style. First, modulo the exception-checking, the code adheres pretty closely to the grammar specification. There’s a lot of boilerplate and calls to skipspace, but the basic structure is similar. That’s the advantage of this style. Second, there’s a lot of boilerplate on each line as the string and index are updated. Even when you don’t care about the value returned by a parser, you have to call it something (I use _). Also, there’s a lot of checking for Exception to see if the parse has failed. In fact, it should really happen after every parser call, I was just too lazy to put all that checking in.
Fortunately, both the assignment and error-checking boilerplate is extremely regular and easily factored out, leaving the grammar-like advantages intact. This is where the monad functions come into play. I’ll define and use those next time.
For now, here are some examples of the parsers:
>>> run('foo')
('foo', [])
>>> ('atom', [('foo', []), ('bar', [])])==run('(atom foo bar)')
True
>>> run(test2)
('S', [('VVPS', [('POPPHH', [])]), ('VNDD', [('HVPS', [])]), ('VVIV', [('VVSN', [])]), ('NP', [('CONJP', [('NN', [('PR', [])]),
('IK++', [('PN', [])]), ('NN', [('PR', [])]), ('IK++', [('NNDDSS', [])]), ('NN', [('PORP', [])]), ('IK++', [('POOP', [])]),
('NN', [('VVPT', [])]), ('RO', [('PR', [])])]), ('NN__SS', [('PODP', [])])]), ('NP', [('PN____GG', [('NNDD', [])]),
('NN', [('IP', [])])])])
At least it was at my school. Or maybe it was Calvin’s school.
Imagine a backward world where
:: Next Page >>