Feature: Shunting Yard In order to parse expressions, the compiler uses a parser component that uses the shunting yard algorithm to parse expressions based on a table. Scenario Outline: Basic expressions Given the expression expr When I parse it with the shunting yard parser Then the parse tree should become tree Examples: | expr | tree | | "__FILE__" | :__FILE__ | | "1 + 2" | [:add,1,2] | | "1 - 2" | [:sub,1,2] | | "1 + 2 * 3" | [:add,1,[:mul,2,3]] | | "1 * 2 + 3" | [:add,[:mul,1,2],3] | | "(1+2)*3" | [:mul,[:add,1,2],3] | | "1 , 2" | [:comma,1,2] | | "a << b" | [:shiftleft,:a,:b] | | "1 .. 2" | [:range,1,2] | | "a = 1 or foo + bar" | [:or,[:assign,:a,1],[:add,:foo,:bar]]| | "foo and !bar" | [:and,:foo,[:not,:bar]] | | "return 1" | [:return,1] | | "return" | [:return] | | "5" | 5 | | "?A" | 65 | | "foo +\nbar" | [:add,:foo,:bar] | | ":sym" | :":sym" | | ":[]" | :":[]" |As of writing, the parser components have 106 scenarios (each entry in the example tables counts as one scenario) including a few failing ones. Whenever we find anything broken, a new test case goes in. 106 scenarios is nowhere near complete, but it helps tremendously with debugging, particularly with the operator precedence parser, as it's been fairly tedious to adjust and adapt it to handle the peculiarities of Ruby. It also serves as documentation of sorts of what the parser is expected to deliver
def self.parser(scanner, parser) ShuntingYard.new(TreeOutput.new,Tokens::Tokenizer.new(scanner), parser) endWe start by instatiating the parser and passing the most frequently used components to it. By passing an alternative to TreeOutput you an turn the output into reverse polish notation or anything you like. But we want a parse tree. Tokens::Tokenizer is used to retrieve a stream of tokens instead of working on individual characters. Last but not least we're passing it a reference to the recursive descent parser. Yuck. We do this because of Ruby's block syntax, which because of the way blocks can look like literal hashes and/or can be chained and used as part of an expression means we need to be able to call bak out of the shunting yard parser.
def parse(inhibit=[]) out = @out.dup out.reset tmp = self.class.new(out, @tokenizer,@parser) res = tmp.shunt(@tokenizer,[],inhibit) res ? res.result : nil endA couple of oddities here. I've introduced an argument to allow specifying a set of tokens to inhibit and terminate the expression. This is needed because of peculiarities in how Ruby handle commas. Some places it's ok as it separates function arguments or array elements, but you also need to be able to put expressions IN argument lists as default assignments, and there commas are not allowed without parantheses, as they're needed to separate arguments, and the argument list itself is parsed by the recursive descent parser. It would be possible to change this, but I'm not sure I want to try to make the argument list be parsed by the same component until I'm more sure of the consequences. Besides the current arrangement works.
def shunt(src, ostack = [], inhibit = []) possible_func = false # was the last token a possible function name? opstate = :prefix # IF we get a single arity operator right now, it is a prefix operator # "opstate" is used to handle things like pre-increment and post-increment that # share the same token. lp_on_entry = ostack.first && ostack.first.type == :lp opcall = Operators["#call#"] opcallm = Operators["."] lastlp = true
The commented local vars are hopefully reasonably understandable. "lp_on_entry" is set to true if the method is passed an :lp operator in - that means '(','[', '{' etc.. We'll see how that's used later. opcall and opcallm are just convenience shortcuts, though the hardcoded references to Operators[] are ugly (we'll meet more of them, and they'll go as soon as I find a clean way of decoupling the logic in this method). "lastlp" is set to indicate whether or not the last token was an :lp type operator.
src.each do |token,op| if inhibit.include?(token) src.unget(token) break endOk, so this is the start of the main loop. We get the token as a string in in "token", and if it's an operator we get the operator object in "op". We check if the token is member of a set of "inhibited" tokens that we don't allow to be part of an expression. This is used to handle cases where the recursive descent parser wants an expression that stops on encountering something that is normally a legal part of the operator. We'll see it used when we look at the recursive descent parser.
if op op = op[opstate] if op.is_a?(Hash)We then start handling operator tokens - most of the loop is split between handling operators and handling non-operator tokens, with very little shared logic. In some cases there will be two operators that share the same token. Currently we handle :infix_or_postfix vs. :prefix operators, and the only case it's currently used for is "*" as multiplication operator vs. as the "splat" operator (which expands arrays).
# This makes me feel dirty, but it reflects the grammar: # - Inside a literal hash, or function call arguments "," outside of any type of parentheses binds looser than a function call, # while outside of it, it binds tighter... Yay for context sensitive precedence rules. # This whole module needs a cleanup op = Operators["#,#"] if op == Operators[","] and lp_on_entryThe comment above speaks for itself, no? And we see a use for lp_on_entry to help decide the precedence rule to use for the comma, by switching objects around (Note that this also shows an anti-pattern that needs to be fixed: When you have hashes, try to avoid string keys. Strings are expensive objects where symbols get mapped to an integer - for things like this using symbols as keys tends to be more efficient).
if op.sym == :hash_or_block || op.sym == :block if possible_func || ostack.last == opcall || ostack.last == opcallm @out.value([]) if ostack.last != opcall @out.value(parse_block(token)) @out.oper(Operators["#flatten#"]) ostack << opcall if ostack.last != opcall elsif op.sym == :hash_or_block op = Operators["#hash#"] shunt(src, [op]) opstate = :infix_or_postfix else raise "Block not allowed here" endThe whole block above gets executed if the current operator is a '{' (:hash_or_block) or :block ("do"), and it's one of the hairy bits... If we've seen what is possibly the start of a block (possibly, because '{' could also start a hash, hence the symbol), we first check if we're possibly looking at a function call (a pre-requisite for a block) based on the previous token, OR if the last operator to have been pushed on the operator stack is the function or method call operators (note: We currently still differentiate between function and method calls, though that is not a distinction Ruby makes; it is helpful for some aspects of the parser, and it is helpful for the low level s-expression syntax, but this distinction will normally become hidden from the "end-user").
else if op.type == :rp @out.value(nil) if lastlp @out.value(nil) if src.lasttoken and src.lasttoken[1] == Operators[","] src.unget(token) if !lp_on_entry endIf we are not dealing with a potential block, we move on. First, above, we check for a :rp (right parenthesis, bracket or brace) operator. We "fake" a value if we've just seen an empty pair of parentheses, so we don't have to deal with operand-less operators elsewhere. We also fake value if the last token was a comma operator. This is to handle the convenience case in Ruby where arrays or hashes can have a "dangling" comma at the end like so: [1,2,3,] (useful for symmetry and machine generating Ruby source.
reduce(ostack, op)We call reduce in order to fore output of any tighter binding operators. In the case of, say, 1 * 2 + 3, (* 1 2) will get output when encountering '+'.
if op.type == :lp shunt(src, [op]) opstate = :infix_or_postfix # Handling function calls and a[1] vs [1] ostack << (op.sym == :array ? Operators["#index#"] : opcall) if possible_funcIf it's an lp type operator, we do as we did for Hash ('{'), except we also want to selectively output "#index#" or the "#call#" operators if we're dealing with a function call. This occurs when we have an expression like "foo(1)", in which case "possible_func" will be true after "foo" has been encountered, and we then parse "(1)" and pass the result onto the output handler, and then output the "#call#" operator to tie "foo" and it's arguments together. As the slightly misleading comment says, "#index#" is used if we see '[...]' after something that might indicate a function call, as we need to differentiate a[1] (a call to the method "[]" on the object reference held in "a") vs [1] (constructing an Array object).
elsif op.type == :rp breakIf we dealt with an :rp we want to exit from the loop.
else opstate = :prefix ostack << op end end... if not we just push the operator on the operator stack.
else if possible_func reduce(ostack) ostack << opcall end @out.value(token) opstate = :infix_or_postfix # After a non-operator value, any single arity operator would be either postfix, # so when seeing the next operator we will assume it is either infix or postfix. end possible_func = op ? op.type == :lp : !token.is_a?(Numeric) lastlp = false src.ws if lp_on_entry endOk, time to handle non-operator values. Not much to say about this - it's hopefully reasonably understandable. The last line about will skip whitespace including line feeds if we're inside a parenthesized sub-expression, while normally (inside the tokenizer) we only skip whitespace excluding linefeeds. This is because the general rule in Ruby is to allow linefeeds anywhere where it doesn't create an ambiguity. I'm sure there are more special cases we'll need to handle, but for now this approximates the rule closely enough.
if opstate == :prefix && ostack.size && ostack.last && ostack.last.type == :prefix # This is an error unless the top of the @ostack has minarity == 0, # which means it's ok for it to be provided with no argument if ostack.last.minarity == 0 @out.value(nil) else raise "Missing value for prefix operator #{ostack[-1].sym.to_s}" end end reduce(ostack) return @out if ostack.empty? raise "Syntax error. #{ostack.inspect}" endThis is the last part we'll look at (for the "reduce" method, see the earlier post), and it's outside the look. It's down to handling errors and optional operands, and then reducing the operator stack completely before returning the output handler (which should at this point hopefully contain a complete expression).
def parse_block(start = nil) pos = position return nil if start == nil and !(start = expect("{") || expect("do")) close = (start.to_s == "{") ? "}" : "end" ws args = [] if expect("|") ws begin ws if name = parse_name args << name ws end end while name and expect(",") ws expect("|") end exps = parse_block_exps ws expect(close) or expected("'#{close.to_s}' for '#{start.to_s}'-block") return E[pos,:block] if args.size == 0 and !exps[1] || exps[1].size == 0 E[pos,:block, args, exps[1]] endThis is what the shunting yard parser calls back into. Most of this should be fairly understandable assuming you remember that "expect" tries to match a string, and "ws" skips past whitespace.
# instance variables. # if it starts with a single "@", it's a instance variable. if a.to_s[0] == ?@ or @instance_vars.include?(a) offset = @instance_vars.index(a) add_ivar(a) if !offset offset = @instance_vars.index(a) return [:ivar, offset] endAt the moment the above contains a bug, as the instance variables will actually get added too late for the "new" method to correctly allocate the object of the right size. So in the next batch of changes will be one to "visit" the nodes of a class definition to pre-allocate the instance variable offsets.
if atype == :ivar ret = compile_eval_arg(scope,:self) @e.save_to_instance_var(source, ret, aparam) return [:subexpr] endEmitter#save_to_instance_var just stores the value to the appropriate offset into the object.
def __new_class_object(size) ob = malloc(size) %s(assign (index ob 0) Class) ob end class Class def new # @instance_size is generated by the compiler. YES, it is meant to be # an instance var, not a class var ob = malloc(@instance_size*4) %s(assign (index ob 0) self) ob end endCalls to __new_class_object is called by the bootstrap code generated by the compiler (in compile_class). It just uses the C-library's malloc() for now, and assigns a pointer to the class Class in the first slot.