As I wrote previously I'm experimenting with an 'assembly language' for parsing.
I'm about halfway through writing the parser for the assembler in itself and temporarily hard wiring it into the virtual machine to bootstrap it.
Lessons learned so far include: I really DO want a couple more high level instructions. I've only add BLT and BGT (Branch on Less Than and Branch on Greater Than) to the instruction set so far to make handling ranges easier, but I realise that I really want to create expanded versions of CMP, TRY and REQ (see the earlier entry for descriptions) that will handle ranges instead of single values, as it will dramatically simplify some rules.
The kleene star and one-or-more instructions I hinted at would also have proved very useful, and will certainly be added. All in all I want to focus on two groups of instructions: High usability low level functions (i.e. they manipulate the state of the vm directly) and 100% "composable" functions - that is, instructions that can be composed entirely of low level functions. In my experience that makes implementation so much easier, and maintaining that separation means that you get good coverage of low level instructions so that any high level functionality you leave out is likely to be composable.
Note that I'm NOT aiming for turing completeness, though it wouldn't be strange if it happens by chance. The goal is a very restricted language only intended for parsing.
I've thought quite a bit about it over the weekend, and in particular whether or not it is likely to be particularly useful, considering the availability of similar tools.
What I've realised, though, is that given the complexity of parsers, it's very rare for parser generators to meet my needs, and it's very rare for me to be happy with generated parsers without manual modification. However modifying lex and yacc parsers manually is something you'd probably not want to try. Modifying an assembly like language, however, is fairly straightforward (that's the old 6510 and M68000 demo programmer in me speaking). I've actually written a whole compiler in M68k assembly before, and so this is bringing back memories - M68k assembly was actually quite well suited at least for the parsing aspect.
The advantage of a heavily restricted VM where programs will mostly use relatively high level constructs is that it will be easy to make reasonably well performing interpreter for it. Given the current size (ca. 300 lines of C++) it is reasonable to assume that a competent programmer could port it to any language in a day or to, and instantly get access to working versions of any grammar translated into the language. This is one of the areas where most parser generators become a burden.
Another thing I want to try is building a relatively sophisticated BNF -> parser "compiler". Most of the constructs I've added are very well suited for translating to from BNF, as I'm used to using BNF as a starting point for all parsers I write. There are quite a few things such a compiler could do very easily with the instruction set I'm working on to generate a much faster parser than what I'd write by hand (because if I write by hand I'd be aiming for simplicity...):
- inlining productions.
- "lifting" shared or "almost shared" initial terms in OR groups. If you write a grammar where a bunch of productions all start with a character, and use them in an or expression like this: (foo | bar | baz) it makes sense for the compiler to generate a single check for a range of characters to allow. This is fairly trivial to add.
- merging of similar subtrees. That is, if I have productions in an or-expression that expect the same initial characters it can often be fairly easy to check for the initial characters separately.
All of this serves to bring the resulting assembly closer to expressing an NFA for the language, but just expressing it in bytecode instead of a table of transitions. The advantage is that it'll be fairly easy to make it possible to turn these optimisations off to get readable assembly to look at to debug the parser.
Another advantage I see from this approach is exactly the debugability - I already have a "tracing mode" for my VM that will output the exact instruction stream as executed, and it makes it so much easier to diagnose problems.
The last advantage I see is size. Expat (a C XML SAX parser) on my system is about 126k. The full VM executable (meaning it's also dragged in a lot of library code) plus most of the assembler, and debug output and without optimization is currently 38K. A C version would probably weigh in at significantly less. Clean it up, and add bytecode for an XML parser - given my experience with it so far I think it would weigh in at about 5-6KB and conversion functions for UTF-8/UTF-16 and latin-1 I think it should be easy to get a full XML parser into less than 40K. Quite possibly less than 30... Looking forward to trying.