Simple push parsers

I've been toying with a simple table driven push parser class today. Normally I write my parsers as recursive descent either with or without a separate lexer stage.

However I've already disliked pull parsers because it's inflexible - the parser and not you control the amount of IO. As such it easily forces you towards multithreading even when you could've easily multiplexed the application logic.

A push parser by contrast need to work only on the input fed to it. A common way of doing that is in the form of a Nondeterministic finite automaton or a deterministic finite automaton, or similar techniques such as a pushdown automaton, which all can easily be designed to work with single character inputs.

However, I wanted a class that let me easily handwrite parts, so what I ended up with was the following:

A table driven parser with a table per production. For each entry in each table I store a flag to indicate if it's optional, a pointer to another table, and a pointer to an "acceptor object".

The "acceptor" is simply a simple class that provides a method to check whether or not it will accept the current character, and whether or not or not it's reached the end. It allows me to simply customize behaviour, and dramatically cuts down on states by letting me define generic constructs such as "recognise this string".

A simple parser class push states onto a stack until it reaches the first state with no pointer to another production. Once an acceptor is "done", the parser moves to the next entry in the topmost table. Once it reaches the end, it pops the state and skips to the next entry in the new topmost table. It continues until the stack is empty.

This is not to be confused with a pushdown automaton, where the stack is used to store symbols that have been parsed not the history of states.

Actually, this is more or less recursive descent turned outside in - imagine writing a recursive descent parser in a language that supports co-routines: Instead of reading a character, the parser will always yield and won't regain control until a new character is available. Only in this case this is made explicit by returning and retaining an explicitly managed stack

I'm sure this isn't an original technique - it's too simple - but I can't remember if I've seen it describe anywhere. If anyone recognise it from elsewhere, let me know as I'm always interested in finding out if I've missed any obvious optimizations.

Simple push parsers 2005-03-22