Writing a (Ruby) compiler in Ruby bottom up

This is part of a series I started in March 2008 - you may want to go back and look at older parts if you're new to this series.

Send

One of the great "joys" of trying to compile Ruby is of course the level of dynamism. Especially when trying, as I am, to make the compiler as static as possible.

One of the culprits is the ability to send messages to objects that may be composed dynamically (e.g. formatting strings), and that may or may not even exist at compile time.

So far we've avoided this. There are two issues here:

We need to be able to handle methods with names that are not present in the application at compile time. Currently we put everything into vtables, and to know how to allocate slots we need to know the names at compile time.
We need to be able to look up methods by name, and do something with the information.

Matz' Ruby implementation does this with a hash table of methods. And we will too. This does, however, not mean we'll ditch the vtables. The vtables are essential to get fast method calls for methods we call by names statically known, which for most applications will likely be the vast majority.

Even for method calls where the target method was not defined at compile time this works, by pre-allocating a vtable slot.

Instead what we will do is add redundant information for the methods known at compile time, and augment this with the ability to add new methods down the line.

The initial motivation for this is the work towards self-hosting, and the compiler itself, foolishly or not, uses #send to call the #parse_XXX methods in Parser for example.

CS101 - Hash tables

The last time I wrote a hash table implementation from scratch was more than 20 years ago, in my introductory data structures and algorithms course. I read the book on a long train journey and mostly ignored the lessons. Apart from frustrating everyone by insisting on doing a N-Queens exercise in Amiga-E (a wonderful language by the amazing Wouter van Oortmerssen, though unfortunately rather esoteric even back in '94)

Thankfully, making a simple hash table is, well, simple.

We'll do a very basic linear scan hash table. If you've forgotten your CS courses and is too lazy to click the wikipedia link, that means that once we've calculated which slot an entry should ideally go in, if it is occupied, we linearly probe the other slots until we find an empty one.

There are all kinds of upsides and downsides from various methods for scanning/probing, and from using scans instad of alternative structures, but the point as usual is simplicity first. Especially in this case as we won't even know the access patterns until things are complete enough to do proper benchmarks.

A hash function, and comparisons

Firstly, let us implement #hash. The documentation for Object#hash says this:

Generates a Fixnum hash value for this object. This function must have the property that a.eql?(b) implies a.hash == b.hash.

The hash value is used along with eql? by the Hash class to determine if two objects reference the same hash key. Any hash value that exceeds the capacity of a Fixnum will be truncated before being used.

The hash value for an object may not be identical across invocations or implementations of Ruby. If you need a stable identifier across Ruby invocations and implementations you will need to generate one with a custom method.

This is thankfully easy to match, and we'll do it with the DJB hash function used in amongst others CDB. It's simple, and reasonably efficient:


    +  # DJB hash
    +  def hash
    +    h = 5381
    +    each_byte do |c|
    +      h = h * 33 + c
    +    end
    +    h
    +  end

You'll note we use a multiply by 33 instead of the ((h << 5) + h) in the description. The real reason for this is that I haven't implemented << yet. But in our case, given Ruby's pathological amount of method calls, h * 33 is actually even likely to be faster than ((h << 5) + h) until we actually add some fairly advanced optimization (so ca. 2030 at my current pace).

Note that our function is not entirely equivalent to the original djb hash function, as the original assumes truncation to a 32 bit value. It has little practical implication here given that we will be using a modulo of it in the actual code.

We'll also add String#eql?, which in this case is just an alias for #==:


    +  def eql? other
    +    self.== other
    +  end

(In 6b01ee3)

Handle modulus

Since I happen to know we'll want a working modulus operator (because we'll want to take the hash value of a string, and take the modulus of the hash value and the capacity of the Array we'll store the hash table in to find out where to start probing), that's what we'll fix next.

We start out gently in 3f24489 by adding it to the operator table in operators.rb, and in e6bb8c7 we add the mod keyword to the s-expression syntax to give us something to implement it with.

In 88066af we implement the mod keyword:


    --- a/compile_arithmetic.rb
    +++ b/compile_arithmetic.rb
    @@ -46,8 +46,16 @@ class Compiler
             @e.movl(@e.result, dividend)
             @e.sarl(31, dividend)
             @e.idivl(divisor)
    +
    +        yield if block_given?
           end
         end
         Value.new([:subexpr])
       end
    +
    +  def compile_mod(scope, left, right)
    +    compile_div(scope,left,right) do
    +      @e.movl(:edx, @e.result)
    +    end
    +  end
     end

Piece of cake as div already does the calculation, we just need to actually save the modulus that's already calculated.

Then starts the pain - or "% is the devil"

% is one of the most heinous abominations of Ruby syntax. I'm not talking about the modulus operator. I'm talking about the other %. The evil twin.

You have met it. It's the % that starts quoted strings og all kinds. We've "stolen" one of the codes for the s-expression syntax.

Only % is really horrendously ambiguous. For example, consider what this expression should parse as:

% x

In a sane, just world, this would be a syntax error.

Nah. Too obvious, isn't it? Can't be anything but "x". Uh? What?!?

As it happens, in contexts where it can't validly be the infix modulus opeator, % starts a quoted string even when the following character is a space.

For characters outside of a specific set of exceptions with special meaning, % will treat the character immediately following as the "quote" character that will start and terminate the given string. So above, we're basically using space as the equivalent of a quote.

(If you have ever seen that used other than as a way of terrorising language implementors or other unsuspecting ~~developers~~ victims, I want to know)

Of course this broke the parser, as we've consistently gone for simplicity first...

First of all, let's get some test cases in place (in 8473eaecb):


    
        @mod
            Scenario Outline: Simple expressions
                    Given the expression <expr>
                    When I parse it with the full parser
                    Then the parse tree should become <tree>
    
            Examples:
          | expr       | tree                 | notes  |
          | "% x "     | [:do,"x"]            |        |
          | "a + % x " | [:do,[:+,:a,"x"]]    |        |
          | "1 % 2 "   | [:do,[:%,1,2]]       |        |
          | "1 % 2"    | [:do,[:%,1,2]]       |        |

Then we fix it. As it happens the fix is simple. As horrendous as this syntax looks, at least the cases we're concerned with now are just a matter of distinguishing between a prefix and infix operator, and the shunting yard parser has a mechanism for that. First we change the operator table (in c276794):


    --- a/operators.rb
    +++ b/operators.rb
    @@ -97,7 +97,10 @@ Operators = {
         :prefix => Oper.new( 20, :-,      :prefix)
    },
       "!"         => Oper.new( 10, :"!",      :prefix),
    -  "%"         => Oper.new( 20, :"%",      :infix),
    +  "%"         => {
    +    :infix_or_postfix => Oper.new( 20, :"%",      :infix),
    +    :prefix => Oper.new( 20, :quoted_exp, :prefix)
    +  },
       "/"         => Oper.new( 20, :/,      :infix),
    "*"         => {
         :infix_or_postfix => Oper.new( 20, :"*",   :infix),

This basically makes the parser accept both versions in their respective contexts.

Then we need to handle it. As it happens we can use the opportunity for some minor cleanup. We already added code to special case the operator case of '%', and that's actually what broke the modulo operator case. So now we rip that out, and add a helper #get_quoted_exp to let the parser choose whether to call back in and parse a quoted expression (and make use of that helper one other place as well) (in 3815fb2):


    --- a/tokens.rb
    +++ b/tokens.rb
    @@ -143,10 +143,15 @@ module Tokens
           @s.unget(s)
         end
    
    +    def get_quoted_exp(unget = false)
    +      @s.unget("%") if unget
    +      @s.expect(Quoted) { @parser.parse_defexp } #, nil]
    +    end
    +
         def get_raw
           case @s.peek
           when ?",?'
    -        return [@s.expect(Quoted) { @parser.parse_defexp }, nil]
    +        return [get_quoted_exp, nil]
           when DIGITS
             return [@s.expect(Number), nil]
           when ALPHA, ?@, ?$, ?:, ?_
    @@ -182,11 +187,6 @@ module Tokens
           else
             # Special cases - two/three character operators, and character constants
    
    -        if @s.peek == ?%
    -          r = @s.expect(Quoted) { @parser.parse_defexp }
    -          return [r,nil] if r
    -        end
    -

Note that in the operator case, the rest of that code have already 'eaten' the '%' so we need an option to unget it.

We also expose the #get_quoted_exp helper in tokenizeradapter.rb.

And finally in shunting.rb we trigger on quoted_exp and call back into the tokenizer via the adapter (in 3815fb2):


    +    def parse_quoted_exp
    +      @tokenizer.get_quoted_exp
    +    end
    +
         def shunt(src, ostack = [], inhibit = [])
           possible_func = false     # was the last token a possible function name?
           opstate = :prefix         # IF we get a single arity operator right now, it is a prefix operator
    @@ -90,6 +94,8 @@ module OpPrec
                 else
                   raise "Block not allowed here"
                 end
    +          elsif op.sym == :quoted_exp
    +            @out.value(parse_quoted_exp)
               else
                 if op.type == :rp
                   @out.value(nil) if lastlp

Finally we can implement the '%' operator in Fixnum (in 9f6f151):


    --- a/lib/core/fixnum.rb
    +++ b/lib/core/fixnum.rb
    @@ -5,6 +5,16 @@ class Fixnum < Integer
         %s(assign @value 0)
       end
    
    +  def % other
    +#    %s(printf "%i\n" (callm other __get_raw))
    +    %s(assign r (callm other __get_raw))
    +    %s(assign m (mod @value r))
    +
    +    %s(if (eq (gt m 0) (lt r 0))
    +         (assign m (add m r)))
    +    %s(__get_fixnum m)
    +  end
    +

Note the last if expression, which is used because the x86 instruction set and Ruby disagrees on how to handle modulo operations with negative numbers, and I opted to stick with code similar to what gcc outputs for C rather than encode the Ruby behaviour in the s-exp syntax. I may yet change my opinion on that - we'll see, but for now this works.

Minor Array fixes

In addition to the above, in 928fe97 I add support for Array.new(capacity) and Array#capacity, as well as make Array#[]= work for the basic cases, as this will be needed for the Hash implementation below.

Creating the class hash table

In 913a3de you will find an initial implementation of Hash. Notably, this implementation is pure Ruby, and (basic, simple) programs will not even necessarily break if you load this implementation in MRI and proceed to use it. It also maintains insertion order, like MRI 1.9+, which aids in testing vs. MRI.

It does, however, lack most methods, but it should be sufficient for us for the next steps. It is worth noting, though, that it is certainly still inefficient in eg. the interaction with Array, and there's plenty of scope for improvement in the future.

I'll leave figuring the hash implementation out as an exercise for the reader (or ask in the comments), as there's nothing specific to the compiler in that one and it should be quite simple: Basically I maintan an array of slots consisting of pointers - one to the key, one to the value, and a linked list to maintain insertion order. Then on setting and reading the hash table, we calculate the hash of the key, look up the first applicable slot, then we loop over the hash table until we've found a matching key, possibly wrapping around, or we find an empty slot.

If the table is too small, we re-allocate the array and re-insert all the key/value pairs.

We could do this more efficiently for special cases like the vtable where we don't need the insertion order, for example, but as usual, efficiency for later.

Creating a hash table of String => Symbol

But in the interest of (some) efficiency as in this case it also brings us closer to feature-completeness, I will first make one further improvement to Symbol handling.

Since we now have a Hash table, we can store a hash mapping Symbol names to Symbol instances, and thus gain the two main advantages Symbol is meant to have: Ability to treat object identity as equality (in other words: compare pointers rather than string values in our case), and secondly reusing the same object instance each time the symbol is mentioned.

We'll use this to make the cost of the class method tables less excessive.

Firstly, to make Symbol work with these changes, it was necessary to move it later in the bootstrap of the runtime (in fc674f3). This is one of the things I find most interesting with working on this compiler: Slowly teasing out a minimal core of Ruby and figuring out how to bootstrap just the lower level facilities and bootstrap. That's going to be an interesting subject in its own right at some point.

Secondly we implement Object#object_id as simply returning %s(__get_fixnum self). So (and this should not be relied on), #object_id in our case is actually the fixnum version of the pointer to the object.

We strictly don't need this, but it's used by my test case, and also lets me isolate the s-expression parts further, implementing Object identity comparisons (note: these actually belongs in Comparable, but we don't do include and friends yet). (In 38ca5270e3)

These leave us with a very small manageable set of changes to Symbol. Small enough to just include the whole rudimentary Symbol class here (changes can be seen in 9abb565, and followup in 077dcf3):


    class Symbol
       @symbols = {}

The above is our symbol-name to Symbol hash table.

Cleaned it up to use proper String objects for the name (though #to_s should probably use #dup, which I don't think we've implemented yet)


      # FIXME: Should be private, but we don't support that yet
      def initialize(name)
        @name = name
      end
    
      def to_s
        @name
      end
    
      def hash
        to_s.hash
      end
    
      # FIXME
      # The compiler should turn ":foo" into Symbol.__get_symbol("foo").
      # Alternatively, the compiler can do this _once_ at the start for
      # any symbol encountered in the source text, and store the result.
      def self.__get_symbol(name)
        sym = @symbols[name]
        if sym.nil? ## FIXME: Doing !sym instead fails w/method missing
          sym = Symbol.new(name)
          @symbols[name] = sym
        end
        sym
      end
    end
    
    %s(defun __get_symbol (name) (callm Symbol __get_symbol ((__get_string name))))

Here is the fun: Look up the symbol; if not present in the hash table, create a new one, and return it.

One notable thing: MRI recently added garbage collection of symbols. The above actively works counter to garbage collection of symbols - once we get to add a GC we'll want to deal with that by letting the GC handle certain structures, such as this hash, as "weak" references that can be broken. It's an extra complexity, but far away.

Bootstrapping the values

Eventually we are going to use two separate types of hash tables: Symbol => offset in vtable for statically allocated methods, and Symbol => address of method for runtime defined methods. This will keep the class specific hash tables small at the cost of one extra hash table lookup in #send, but #send will always be slow compared to the direct vtable lookup anyway.

This time we will only actually implement the former, and only handle cases of "#send" called with method names we've statically seen. This actually covers a substantial proportion of the use of #send. The big exception is cases where a method is both dynamically defined and dynamically used by string composition or by requiring files at load time (which we don't support yet). Most importantly at this stage it handles the specific case of bootstrapping the compiler itself.

We'll declare it out of bounds for the compiler itself to use #send before a suitable point in the bootstrap process, where we call a compiler generated function to bootstrap this hash table.

Calling #send before that will trigger a runtime error.

In 5a269ec I add a new file lib/core/class_ext.rb that implements the send "send" logic, and the code to bootstrap these values. This goes in a separate file for the simple reason that it needs to be loaded after Symbol and various other things have gotten bootstrapped, but the basics of Class are needed much earlier.

First of all, this is how the send happens:


      # This is called by __send__
      def __send_for_obj__ obj,sym,*args
        voff = Class.method_to_voff[sym]
        if !voff
          %s(printf "WARNING: __send__ bypassing vtable (name not statically known at compile time) not yet implemented.\n")
          %s(if sym (printf "WARNING:    Method: '%s'\n" (callm (callm sym to_s) __get_raw)))
          %s(printf "WARNING:    symbol address = %p\n" sym)
          %s(printf "WARNING:    object = %p\n" obj)
          %s(printf "WARNING:    class '%s'\n" (callm (callm self name) __get_raw))
          nil
        else
          # We can't inline this in the call, as our updated callm
          # doesn't allow method/function calls in the method slot
          # for simplicity, for now anyway.
          %s(assign raw (callm voff __get_raw))
          %s(callm obj (index self raw) ((splat args)))
        end
      end

There's not that much new or different here. Mainly we look up the method name in the new hash table (which should not be public, but we'll worry about that later). Secondly, if the vtable offset is known - which will be the case for all methods we've seen statically mentioned - we attempt to convert the stored Fixnum to a raw integer, retrieve the address from the vtable, and do a callm with the address instead of a method name.

This latter part is new, and we'll have a look at how to handle that shortly, but first the last part of lib/core/class_ext.rb:


    #
    # Populate the Class.method_to_voff hash based on the __vtable_names
    # table that gets generated by the compiler.
    #
    %s(let (i max)
       (assign i 0)
       (assign max __vtable_size)
       (assign h (callm Class method_to_voff))
       (while (lt i max)
          (do
             (if (ne (index __vtable_names i) 0)
                (callm h []= ((__get_symbol (index __vtable_names i)) (__get_fixnum i)))
                )
              (assign i (add i 1))
          )
       )
    )

This is fairly straight forward: We iterate over a table name __vtable_names that the compiler will henceforth output, and for __vtable_size entries we obtain a Symbol object, and then add a mapping from the ymbol to the according offset to the Class.method_to_voff hash table.

This table was actually added in e6bb8c75 - I think that method in Compiler should be reasonable self-explanatory - basically outputs an array of pointers to constant strings.

Handling callm with computed addresses

Previously %s(callm ...) have only supported names. Now it needs to supports %s(callm object <some expression resulting in method address> ...) too.

We handle this as follows. Firstly, we (in 46bb9d0) we check if the method is provided by name (as a Symbol), and only use the old logic if it is:


    +      if method.is_a?(Symbol)
    +        off = @vtableoffsets.get_offset(method)
    +        if !off

Then further down, we conditionally compile the actual call differently:


    -        @e.callm(m)
    +        if off
    +          @e.callm(m)
    +        else
    +          # NOTE: The expression in "method" can not
    +          # include a function call, as it'll clobber
    +          # %ebx
    +          @e.call(compile_eval_arg(scope,method))
    +        end

I'm not quite pleased with the complexity of compile_call.rb, but it'll do for now.

Parting words

And that's in. Doing #send now works. And we have a somewhat working Hash.

Not bad.

And with this I will start doing what I've mentioned for a while now and switch to writing shorter, more focused entries on specific features rather than these huge parts. And feature branches instead of branches focused on a specific part....

Hopefully it will get things moving a bit quicker.

(This also signals that I'll be more open to outside contributions and suggestions, though I recognise this probably isn't the easiest codebase to jump into, despite the copious amount of writing I've done about it's internals)

Writing a (Ruby) compiler in Ruby bottom up - step 45 2015-07-26