Turtle progress and parser assembler

After my success with the BNF generator yesterday, I just had to redo my Turtle parser based on the BNF in the Turtle spec. Thanks to the clarity of the grammar I rewrote the parser in less than an hour. Though "rewrote" is exaggerating - I cut and pasted the BNF from the spec and massaged it a bit (for one, I've realised that I messed up the priority of the "or" operator so I need to paranthesise bits - I'll have to fix that) and added the appropriate triggers and start adding error handling.

It seems to pass all the tests referenced in the spec now, though I've taken one or two shortcuts with character set handling that needs to be fixed.

However doing the Turtle parser now and the XML parser last night made me think of one important improvement to make, though it will cause the BNF generator to grow quite a bit:

A lot of the error handling can be added automatically. The key is that an error in a recursive descent parser should generally be added in the first position you know you don't have any further valid paths. In other words, you "only" need to create a graph and do some basic path analysis to figure out that in Turtle for instance, if you've successfully swallowed "@prefix" no other rules can handle the input if you were to backtrack back to the "@".

As a result, you should give an error message immediately if the next step after "@prefix" doesn't match expectation, or if the remainder of the rule as a whole fails.

At the same this this analysis allows significant optimisations of the code: By factoring out the common parts of rules, they can be combined and the code paths merged.

This is essentially what a good NFA/DFA generator will do, and when you look at it, the parser assembler isn't that far from a DFA converted into the steps to execute. The fact that it's human readable, portable and easy to customise or hand write code for is the main advantage as I see it, not the functionality of the VM itself.

As it stands, though, the BNF generator generated Turtle grammar takes only 30 more bytes than the hand written one that I've just retired, and that will have to do for now while I experiment some more (the bug list for the BNF generator and parser assembler is already getting quite long, and it will get longer, but so far all of them have reasonable workarounds so I'm just recording them for now).

All in all, the Turtle parser C++ and BNF combined is now down to just 264 lines of code (without generic support code like the VM at 369 lines), which I'm quite pleased with. To do something useful (like building an efficient RDF graph) will obviously require quite a bit more.

Turtle progress and parser assembler 2005-04-23