Writing a (Ruby) compiler in Ruby bottom up

This is part of a series I started in March 2008 - you may want to go back and look at older parts if you're new to this series.

Towards Self Hosting

As mentioned last time, this time around, I've first landed a number of changes that will make the parser able to parse the entire compiler (whether or not it parses it correctly or not remains to be seen, and is outside the scope).

You can find these merged in 37d6a0510 and 44914839a72

Secondly, we'll start looking at small parts of the compiler to break into pieces that are:

Amenable to separate testing
Simple enough to try to compile

The goal is to lay the groundwork for refactoring what we now have into smaller units that at the same time both allow us to unit test the module and use it as a test case for the compiler at the same time. Each module we can compile will then both give increased confidence in the generated code and bring us one step closer to the state where the compiler can fully compile itself.

A good set of candidates to start with appears to be the basic Scanner class, followed by the code to parse and print our "s-expression syntax". Will it prove more challenging than expected? We'll see...

Compiling Scanner

The remarkable thing is that "out of the box" ./compile scanner.rb actually yields a binary, though it will not-so-remarkably run into missing "stuff" on trying to run it:

$ /tmp/scanner
attr_reader col
define_method col
Method missing: __send__

Note that this also means that e.g. it quietly accepts that it has no definition for Struct. However that is actually in line with Ruby semantics, and one of the things that makes efficiency such a problem:

ruby scanner.rb runs all the way through (with no output), as it is legal to reference an un-defined variable - it is first if we try to access it we are meant to get an error.

In reality, for a compiler, we'll likely want to at least warn about the most obvious of these at least (I'm sure I can fill several articles on the topic of adding warnings for "legal but stupid" behaviour in semi-sane ways, but lets wait until we can actually compile a reasonably sized program first).

The challenges

In its present form, Scanner actually presents some challenges off the bat just from a cursory look:

It defines an inner class
It relies on Struct which we haven't done anything to implement (and then re-opens the new class, which we don't yet support.
It relies on attr_accessor
It uses blocks
It uses #respond_to?
It references Tokens::Keyword. Looking up Tokens::Keyword will likely fail

This is why self hosting is such a useful step - you're instantly faced with the practicalities of a real program written in the language. And while this program is in our control, making too drastic changes to it will defeat the purpose, so we have an incentive to fix things properly whenever reasonable.

Some solutions

We now have to consider whether to add support for these, whether full or temporary hacks, or change Scanner accordingly.

#respond_to?: Doing this properly involves a Hash of the methods of a class. We'll make a note the method hash is needed, and defer that to a future part. Instead we'll hack up a default #respond_to? which is rater anaemic, and override it for ScannerString, which is the point here anyway.
We'll ignore the inner classes, and move ScannerString and Position out of Scanner.
Struct: We'll change the code here, and define a "proper" Position class and defer Struct until later.
attr_accessor: We've had our painful trip down this road before. I'm not inclined to revisit it anytime soon. We'll do this the hacky way, and special case attr_accessor / attr_reader / attr_writer to custom generate methods for now at least (it may be the best long term solution, though they were the "test case" for define_method previously)
Blocks could / should work in theory. Lets see what is broken and fix it.
Tokens::Keyword: Lets refactor this part. The code calling Tokens::Keyword does not belong in the Scanner. Lets move it to ParserBase, and thus defer this issue.
#is_a?: Scanner#initialize uses '#is_a?' to see if it has been given a File or a stream.
File.file?, File.expand_path and #path are used if we're given a File object, to extract and expand the path name.

But first of all, lets start chopping Scanner into pieces and creating test cases to see what works and doesn't, and then gradually flesh out a more and more complete equivalent.

The first few tests

STDIN and STDOUT

We will need to be able to read from somewhere for the Scanner class to work, so lets first add a basic test of STDIN and STDOUT:


        STDOUT.puts "Hello world from STDOUT"
    }}}   
    
    This instantly fails, as we haven't even defined STDOUT. We'll remedy for now by adding
    
        STDOUT = IO.new

to lib/core/core.rb (in d24d56b)

Next we add a test of STDIN:


        puts STDIN.getc

and modify our step definition to echo "test" into the compiled binaries. This will also fail because STDIN is missing. We make the same change as for STDOUT, but in this case it is not enough:

WARNING: __send__ bypassing vtable not yet implemented.
WARNING:    Called with 0x8766bf8
WARNING:    self = 0x8766b78
WARNING:    (string: 'getc')

This one comes when there is no allocated vtable slot. The intent is for the class structures to use vtable slots like in C++ for example for most methods. The heuristics used to allocate vtable slots currently considers only methods we have seen defined, not methods that have been otherwise mentioned. In this case #getc has not been defined anywhere yet, and so the compiler should have done a lookup in a hash table for this class, and then ended up falling back on a generic #method_missing, but the hash lookup is not yet implemented.

Anyway, lets put in place a #getc in lib/core/io.rb. This is harder than expected, as our s-expression language does not expose a way to get addresses to stack variables easily. So here's a first, horribly inefficient, stab (in 54a0c13)


        # FIXME: This code is specific to a 32 bit little endian
        # arch, and is also horribly inefficient because we don't
        # have an easy way of getting the address of a stack allocated
        # variable.
        %s(do
             (assign tmp (malloc 4))
             (assign (index tmp 0) 0)
             (read 0 tmp 1)
             (assign c (__get_fixnum (index tmp 0)))
             (free tmp)
             )
        c
      end

Of course we can do a lot better by adding, say, an "(addr var)" construct. This is still horribly bad, though - no error checking from read; and it just reads from STDIN regardless of desired filedescriptor for this IO object, and it does no buffering, and other nastiness. Plenty to deal with later...

The constructor

Our first test case from the actual Scanner class will be the only tricky part in #initialize:


    class Scanner
      def initialize(io)
        # set filename if io is an actual file (instead of STDIN)
        # otherwhise, indicate it comes from a stream
        @filename = io.is_a?(File) && File.file?(io) ? File.expand_path(io.path) : "<stream>"
    
        puts @filename
      end
    end
    
    Scanner.new(STDIN)

However this will fail:

Object#is_a? not implemented
Method missing: file?

Looking more closely into this, we'll find a few things: First I thought I'd not yet implemented the ternary conditional, but the actual problem is that it fails to parse the ternary correctly in the face of the logical and ('&&'). You can see this if you run the compiler like this: ruby compiler.rb --parsetree -norequire features/inputs/scanner1.rb. This is likely "just" a priority issue.

But let's untangle this into two tests - one with the ternary if, and one without.


        if io.is_a?(File) && File.file?(io)
          @filename = File.expand_path(io.path)
        else
          @filename = "<stream>"
        end

Changing the ternary to a full if will still prove interesting. Let us quickly hack in a #is_a? just for File. First we need to add one to lib/core/object.rb (882c71b) :


       def is_a?(c)
          false # We're pessimists
       end

Now for the next surprise: We don't have false, do we? We've defined false/true in lib/core/core.rb, but there they are ordinary local variables, only accessible in the main scope. We really ought to give them special treatment as fake global variables referring to global instances of FalseClass and TrueClass, but that opens another can of worms (we can no longer assume non-null means true and null means false, which will change comparisons.

For the time being we sidestep this by defining false inObject` as a method (!) (I added that too in 882c71b) :


      def false
         %s(sexp 0)
      end

This gives us another surprise when trying to compile features/inputs/scanner2.rb (04b6060) ; the one without the ternary if:

We still get "Method missing file?", which means the second part of the supposedly short-circuiting '&&' expression gets executed, but we're not passing a File object in.

This is yet another of our earlier convenient hacks that now needs to be sorted out: && gets turned into and, which in turn gets interpreted as a method call because the compiler does not have a builtin and construct. As a result, it is evaluated after its arguments. Or rather, it would have been, had not the second argument been a sub-expression which itself fails.

As a lesson in why proper testing is essential, let us use this as an excuse to skip File.file? and File.expand_path and File#path this time around: Instead let use make the compiler handle && properly, so that we never get to them. Well, as long as we're only ever trying to compile from STDIN anyway. It's a start.

Fixing the short circuit operator.

The changes to handle it are absolutely minor: We need to add ':and' to the list of keywords in compiler.rb, and define compile_and (in 75d1ecd62d):


    +  # Shortcircuit 'left && right' is equivalent to 'if left; right; end'
    +  def compile_and scope, left, right
    +    compile_if(scope, left, right)
    +  end
    +

But the test case also reveals another bug: When adding the caching of 'self', I failed to account for the special handle of 'self' in the outermost scope (it's treated as a global variable). Maybe treating the global scope any differently is the mistake, and we should just put that too on the stack, but for now the fix is trivial: Just add a check for [:global,:self] in get_arg:


    -      if arg.first == :lvar || arg.first == :arg
    +      if arg.first == :lvar || arg.first == :arg || (arg.first == :global && arg.last == :self)

Doing a "./compile features/inputes/shortcircuit.rb" (or running the rspec tests) now yields a /tmp/shortcircuit that actually short-circuits.

And features/inputs/scanner2.rb now successfully sidesteaps the methods on File we're being too lazy to implement. Though features/inputs/scanner1.rb that uses the ternary conditional still crashes, so there's that.

This is where we'll leave it for now.

Conclusion

This part probably seems all over the place, but it's part of the fun once you get to a stage where it becomes viable to start looking at bootstrapping the compiler in itself.

I'll make the next few parts shorter than some of the 3000+ word monsters I've put out in the past, to also try to focus them a bit more, but next time will focus on getting Scanner#get and Scanner#unget to work, and then work our way through Scanner#expect and Scanner#ws too. If you look at our intended milestone of getting the s-expression parser to compile, it might look like that brings us almost there.

Not so - look forward to complication after complication being revealed as we stumble into class after class, method after method, that we need to implement at least partially before we get there. And possibly some more simplifying compiler changes to sidestep some of them.

But incidentally, each one of them will make the next parts of the compiler easier to get past too.

Writing a (Ruby) compiler in Ruby bottom up - step 35 2014-05-16