I have very strong opinions when it comes to parser generators. Most parsers I've written have ended up being hand written or have used some half-assed parser generators I've written myself, because I've yet to find a parser generator I like. I've found many I like aspects of, but they invariably seem to fail on one or more of the following points. I'd love to get feedback on suggestions for parser generators I ought to look at.
Here are my "rules":
-
Low coupling. In this case that translates to avoiding a pet peeve of mine, namely intermingling code in the target language interspersed with the grammar. A good grammar in a clean syntax is documentation as well as code. Messing it up by interspersing actual code is annoying. But there's a more important reason: Any reasonably sized grammar gets complex to understand pretty quickly. Thus I don't want to have to rewrite it if I want to reuse the grammar in another target language. This implies not only expressing the grammar without mixing in code in another language, but also avoiding dependence on external code to change the workings of the grammar. For some languages such as C++ and Ruby that means the typical functionality of a parser generator needs to be extended (C++ for example needs the parser to know whether a given string is a type name or not in some cases, if I remember correctly, and Ruby has that awful - from a parsing perspective at least - syntax for multiple "here documents").
-
Clean syntax. I'm picky about syntax in all languages. I'm extra picky about it in parser specifications. I like BNF style syntax, but I even there I have very specific ideas about what looks good. For example I can't stand the ABNF syntax specified in RFC 2234. I find it outright painful to use "/" instead of "|" for "or". Yes, it probably is because I've gotten used to the pipe-character, rather than any kind of objective qualities, but it matters nonetheless. More important than looks, though, is that the syntax is concise.
I can't stand parser generators that don't make the input really terse. For most simple languages writing a terse recursive descent parser that is extremely readable can be done by implementing a tiny little reusable set of parser combinators (functions or objects that handle a tiny aspect of parsing, such as "or" and defer the rest to other objects, that can be combined by function- or object references or writing functions). If the input to the parser generator isn't terse, much of the advantage (certainly not all) goes out the window.
-
Easy to retarget. I really, really can't stand parser generators that can't easily be modified to target a new language. Most of the good ones like ANTLR and Ragel are fairly nice in that respect (and Ragel might, as far as I can tell, support not mixing in target language code too? I haven't looked enough at Ragel yet).
-
Supports multiple purposes One of my BIG issues with many parser generators is that they tend to support either generating push parsers or generating pull parsers or feed you events or a full parse tree, or only generating parsers etc. Many of them make it hard to for example generate syntax diagrams or tree walkers or document generation, or they do one or more of those, but in one way only and if your model doesn't fit perfectly to theirs, then you have to dig into the core code. It's by no means true for all of them, but it something I've run across time after time. This is a typical case where high cohesion and low couplings means something. Making the AST of the parser generator itself trivially accessible for plugins or filters so that other cools can benefit is a huge plus.
-
Good grammars are simple grammars Ruby's grammar for example scares the shit out of me in terms of all the horrendous edge cases. MOST Ruby usage is clean, and MOST of the time the parser do what you think it should, but it's a monstrous grammar when you actually start digging into it. It doesn't help that the parse.y file of Matz' Ruby interpreter is far more verbose than it needs be. Languages like Oberon have grammars that can fit on an A4 page. With large fonts. And comments. While a language like Ruby can afford to have a somewhat larger grammar, there's no excuse for making it as big as it is. But nor is there an excuse for making matters worse with a verbose parser generator, or one with a complex grammar. If the grammar of the parser generator takes up more than a page, I run away screaming. I want something simple and formal.
-
Generate simple code, or at least offer it as an option. I hate trying to debug grammars without being able to sensibly instrument (automatically or manually) the generated code and be able to read and understand it. Table driven generators like most Yacc/Bison style generators are particularly nasty, but most generators I've seen fail here. I like generators that at least give the option of creating simple, clean, recursive descent parsers, as recursive descent tends to be very simple for humans to understand compared to many of the alternatives. They are also simple to test and instrument, particularly if you can call directly in to specific rules.
A couple of years ago I wrote an experimental "parser assembler" and built a BNF to "assembler" generator and a VM to run it on. It was moderately successful as an experiment in that I was able to build a number of parsers from it very easily. Most BNF could be translated almost mechanically, and work, and at the same time the BNF parser generated a reasonably clean AST that would be simple to manipulate. The parser "assembler" made it highly portable, and made the output readable (with a little bit of effort, admittedly) and made the parsers compact, and I was toying with making another backend to generate x86 assembler directly. Because of the execution model, the parser assembly could be used both as pull or push parsers as desired, and that was what I liked the most about it.
But in retrospect I'm still not happy. The syntax was too messy, and the way it interacted with the "outside" wasn't optimal (you had to keep track of numbered events), and it didn't do anything to generate AST's or make writing tree-walkers simpler etc. So while it avoided a couple of typical pitfalls, and was tiny and didn't intermingle target language code with the parser, I still want something better.
I started on what I thought would be a cleaner grammar, and I believe I got to something that was reasonably nice. I came across it in my notes again the other day when I was cleaning up my latest compiler post. Consider it a strawman - it's not complete, and probably have stupid bugs, since the generator hasn't actually been implemented so I could test it on itself. This grammar defines the hypothetical parser generator syntax in (a subset of) itself:
grammar ParseGen
rules {
start< :grammar> { "grammar" @name @rules }
rules { "rules" "{" $rule* "}" }
rule { @name ("<" @var ">")? "{" @expr* "}"? }
expr { $or_expr }
or_expr { post_expr<@left> ("|" or_expr<@right>)? }
post_expr { $cut | $glue | $simple }
simple { @primary ("<" @var ">")? ("?" | "*" | "+")<@modifier>? }
primary { paren<@expr> | @keywords | @name | @string | @number | @set }
parent { "(" $rule+ ")" }
keywords { "ANY"<:any> | "EOF"<:eof> }
string { (('"' _ ([\"]*)<$str> _' "') | ("'"_ ([\']*)<$str> _ "'")) }
var { ('@'<:var>|'$'<:return>|':'<:symbol>) _ <@vartype> @name }
number { @base10 | @base16 }
name { ([a-zA-Z][a-zA-Z0-9_]*) }
set { '[' (''? ("\\]"|[\]])*) ']' ]
cut { "!"< :cut > }
glue { "_"< :glue > }
base10 { [0-9]+ }
base16 { 'x'_[0-9a-fA-F]+ }
WHITESPACE { 9 | 13 | 10 | ('#' [~\n]* 10) }
}
The observant might notice that the syntax appears to steal a lot of elements from Ruby. That's not a coincidence. I like the way it looks, and it also allows me to use Ruby syntax highlighting to make it look good. I've tried not abusing it too much in terms of changing meaning around. Yes, I'm too lazy to figure out how to write my own syntax highlighter in elisp, so sue me.
Here are some notes on what the syntax is intended to achieve:
- Mostly is a realtively typical EBNF "dialect" and should be fairly recognizable if you've looked at various grammars in EBNF dialects.
- One of the things I hate about many grammars is the way they handle whitespace. I made the decision to make a "WHITESPACE" rule that by default is allowed everywhere unless expressly blocked with the "glue" operator "_". You can then trivially enough get "normal" behavior by setting the WHITESPACE rule to {} (nothing) and creating your own whitespace rule and use it where you want whitespace instead of using the glue operator where you don't want whitespace.
- I want to be able to generate AST's and events driven parsers from the same thing. By default, I am assuming that the name of a rule is going to become either basis for an event or set of events, or a node in the tree. The exception is when a variable starting with "$" is present.
- A variable starting with "$" indicates that the value returned will become the event or AST node, meaning that the current rule will not appear anywhere, and is for convenience only
- Rule names or applications can be suffixed by a variable name enclosed in less than and greater than. If so, that name designates the name of the event or AST node generated if a rule name, or the name of the variable to be set in the event or node if a rule application.
- If a rule application is prefixed with '$' or '@', then the rule name itself is used as the variable name. So "@var" is shorthand for "var"
- If a rule application or variable name is prefixed with ":" then that indicates that the name itself will be returned as a symbol (or equivalent in the target language) if the rule matches, or nil/null/equivalent when it doesn't.
- The "cut" operator, "!", designate a point beyond which backtracking causes an error. This isn't really needed in a language with only one symbol lookahead, but in this case there's no distinction between a separate lexer and the parser. It also doesn't hurt to be more flexible as it often simplifies the parser (at the cost of making it easy to write badly performing ones if you rely on a lot of backtracking). I find this a convenient way of adding error handling, by setting an error code at the cut point (though the current grammar doesn't support that). The "cut" operator is borrowed from Prolog.
The goal is to get a very simple, compact syntax while still retaining enough information to generate full syntax trees. I do aim to add some way of calling out to the target language to carry out some operations too, but I consider that a last resort. I do intend to look at ways of handling at least things like Ruby's "here documents" without resorting to callbacks.
I don't think I'll have time to write actual code for this for a long time, but I'd love comments on the ideas/thoughts above, and I'll definitively keep thinking about it.