Error reporting during parsing 2005-04-04


While working on my assembler/vm for parsing one of the main problems I'd overlooked (which is silly, considering how many parsers I've written to handle this problem) was error reporting. As it turns out, this is a fairly straight forward problem.

I've written parser generators in the past that generated a high level language parser from BNF with some extra annotation (my first attempt was written in AmigaE about 10-12 years ago). The earliest approach I used was a way of marking where to stop backtracking.

A recursive descent parser with limited lookahead have well defined points where you will have an error condition: anywhere where you exceed the lookahead, backtracking past the lookahead threshold means you've encountered a parsing error.

However one of the things I don't want to do is limiting a user to a specific amount of lookahead. At the same time I want it to be easy to limit it.

The solution is to introduce an operator that marks a "point of no return": Once reached, attempts to backtrack past it will trigger an error.

Incidentally, this is an idea I got from Prolog. Prolog is a logic programming language that depends on evaluating rules, possibly a significant tree of possibilities, and backtrack if a path fails. To cut down processing, and direct the search for a solution, it has something called the "cut operator". The cut operator "prunes" the search tree by marking a similar type of point of no return. The cut operator tells Prolog that if it has reached that point, it shouldn't consider further rules.

A recursive descent parser is a, usually hard coded, decision tree, and using a cut operator saves you from evaluating all possible alternatives when you know they can't possibly work. A typical example is a parser that accepts only numbers and alphabetic names - once you've matched the first character you KNOW the other rule must fail, so it makes no sense to allow the parser to try to evaluate it. In that simple case it doesn't make much difference, but if you have hundreds of rules, it can really make a significant difference.

More than that, however, it makes it easy to flag what went wrong - by attaching an error code or a string to my cut operator, I have a simple mechanism for flagging a human readable error.

I'm letting the VM track column and line numbers, and I am considering also setting a default error string for each label jumped to, so that it becomes simple to automatically generate reasonable standard error messages whenever I insert my cut operator.

So, is my number of operators blowing up? Not really. Over the original instruction set I've added a NOP (no operation) operator, a BLT (Branch on Less Than) and BGT (Branch on Greater Than) operators. I've also added a range comparison operator, but I'm of two minds about it - it can easily be implemented in the assembler only and be made to emit multiple opcodes. That's the approach I've taken with quite a few other operators I've added to the assembler - including "KLN" for "kleene star" (matching a production 0 or more times).

For the most part I think I will try to keep the VM as minimal as possible - at least until I have some profiling data for some test parsers to use for determining what operators truly deserve to be optimized.

I spent a sizable chunk of the weekend (when I should have been reading for my exam) cleaning up the code, and it's getting time to post it, I think, even though the opcodes and assembler syntax will remain unstable for some time. Stay tuned!


blog comments powered by Disqus