This is part of a series I started in March 2008 - you may want to go back and look at older parts if you're new to this series.
As mentioned last time, this time around, I've first landed a number of changes that will make the parser able to parse the entire compiler (whether or not it parses it correctly or not remains to be seen, and is outside the scope).
Secondly, we'll start looking at small parts of the compiler to break into pieces that are:
The goal is to lay the groundwork for refactoring what we now have into smaller units that at the same time both allow us to unit test the module and use it as a test case for the compiler at the same time. Each module we can compile will then both give increased confidence in the generated code and bring us one step closer to the state where the compiler can fully compile itself.
A good set of candidates to start with appears to be the basic
followed by the code to parse and print our "s-expression syntax". Will it prove
more challenging than expected? We'll see...
The remarkable thing is that "out of the box"
./compile scanner.rb actually
yields a binary, though it will not-so-remarkably run into missing "stuff" on trying to run it:
Method missing: __send__
Note that this also means that e.g. it quietly accepts that it has no
Struct. However that is actually in line with Ruby
semantics, and one of the things that makes efficiency such a problem:
ruby scanner.rb runs all the way through (with no output), as it is
legal to reference an un-defined variable - it is first if we try to
access it we are meant to get an error.
In reality, for a compiler, we'll likely want to at least warn about the most obvious of these at least (I'm sure I can fill several articles on the topic of adding warnings for "legal but stupid" behaviour in semi-sane ways, but lets wait until we can actually compile a reasonably sized program first).
In its present form,
Scanner actually presents some challenges off
the bat just from a cursory look:
Struct which we haven't done anything to implement (and
then re-opens the new class, which we don't yet support.
Tokens::Keyword. Looking up
Tokens::Keyword will likely fail
This is why self hosting is such a useful step - you're instantly faced with the practicalities of a real program written in the language. And while this program is in our control, making too drastic changes to it will defeat the purpose, so we have an incentive to fix things properly whenever reasonable.
We now have to consider whether to add support for these, whether full or temporary
hacks, or change
#respond_to?: Doing this properly involves a
Hash of the methods of a class.
We'll make a note the method hash is needed, and defer that to a future part.
Instead we'll hack up a default
#respond_to? which is rater anaemic, and
override it for
ScannerString, which is the point here anyway.
We'll ignore the inner classes, and move ScannerString and Position out of Scanner.
Struct: We'll change the code here, and define a "proper"
Struct until later.
attr_accessor: We've had our painful trip down this road before. I'm not inclined
to revisit it anytime soon. We'll do this the hacky way, and special case
attr_writer to custom generate methods for now at least (it may be the
best long term solution, though they were the "test case" for
Blocks could / should work in theory. Lets see what is broken and fix it.
Tokens::Keyword: Lets refactor this part. The code calling
Tokens::Keyword does not
belong in the
Scanner. Lets move it to
ParserBase, and thus defer this issue.
Scanner#initialize uses '#is_a?' to see if it has been given a
File or a stream.
#path are used if we're given a
File object, to extract and expand the path name.
But first of all, lets start chopping
Scanner into pieces and creating test cases
to see what works and doesn't, and then gradually flesh out a more and more complete
We will need to be able to read from somewhere for the
Scanner class to work, so lets
first add a basic test of
STDOUT.puts "Hello world from STDOUT"
This instantly fails, as we haven't even defined STDOUT. We'll remedy for now by adding
STDOUT = IO.new
lib/core/core.rb (in d24d56b)
Next we add a test of STDIN:
and modify our step definition to echo "test" into the compiled binaries. This will also fail because STDIN is missing. We make the same change as for STDOUT, but in this case it is not enough:
WARNING: __send__ bypassing vtable not yet implemented.
WARNING: Called with 0x8766bf8
WARNING: self = 0x8766b78
WARNING: (string: 'getc')
This one comes when there is no allocated vtable slot. The intent is for the class structures
to use vtable slots like in C++ for example for most methods. The heuristics used to allocate
vtable slots currently considers only methods we have seen defined, not methods that have
been otherwise mentioned. In this case
#getc has not been defined anywhere yet, and so
the compiler should have done a lookup in a hash table for this class, and then ended up
falling back on a generic
#method_missing, but the hash lookup is not yet implemented.
Anyway, lets put in place a
lib/core/io.rb. This is harder than expected, as
our s-expression language does not expose a way to get addresses to stack variables easily.
So here's a first, horribly inefficient, stab (in 54a0c13)
# FIXME: This code is specific to a 32 bit little endian
# arch, and is also horribly inefficient because we don't
# have an easy way of getting the address of a stack allocated
(assign tmp (malloc 4))
(assign (index tmp 0) 0)
(read 0 tmp 1)
(assign c (__get_fixnum (index tmp 0)))
Of course we can do a lot better by adding, say, an "(addr var)" construct. This is
still horribly bad, though - no error checking from
read; and it just reads from
STDIN regardless of desired filedescriptor for this
IO object, and it does no
buffering, and other nastiness. Plenty to deal with later...
Our first test case from the actual Scanner class will be the only tricky part in
# set filename if io is an actual file (instead of STDIN)
# otherwhise, indicate it comes from a stream
@filename = io.is_a?(File) && File.file?(io) ? File.expand_path(io.path) : "<stream>"
However this will fail:
Object#is_a? not implemented
Method missing: file?
Looking more closely into this, we'll find a few things: First I thought I'd not yet implemented
the ternary conditional, but the actual problem is that it fails to parse the ternary correctly
in the face of the logical and ('&&'). You can see this if you run the compiler like this:
ruby compiler.rb --parsetree -norequire features/inputs/scanner1.rb. This is likely "just" a priority issue.
But let's untangle this into two tests - one with the ternary if, and one without.
if io.is_a?(File) && File.file?(io)
@filename = File.expand_path(io.path)
@filename = "<stream>"
Changing the ternary to a full if will still prove interesting. Let us quickly hack in
#is_a? just for
File. First we need to add one to
lib/core/object.rb (882c71b) :
false # We're pessimists
Now for the next surprise: We don't have false, do we? We've defined false/true in
lib/core/core.rb, but there they are ordinary local variables, only accessible in the main scope. We really ought to give them special treatment as fake global variables referring to global instances of
TrueClass, but that opens another can of worms (we can no longer assume non-null means true and null means false, which will change comparisons.
For the time being we sidestep this by defining
false inObject` as a method (!) (I added that too in 882c71b) :
This gives us another surprise when trying to compile
features/inputs/scanner2.rb (04b6060) ; the one without the ternary if:
We still get "Method missing file?", which means the second part of the supposedly short-circuiting '&&' expression gets executed, but we're not passing a
File object in.
This is yet another of our earlier convenient hacks that now needs to be sorted out:
&& gets turned into
and, which in turn gets interpreted as a method call because the compiler does not have a builtin
and construct. As a result, it is evaluated after its arguments. Or rather, it would have been, had not the second argument been a sub-expression which itself fails.
As a lesson in why proper testing is essential, let us use this as an excuse to skip
File#path this time around: Instead let use make the compiler handle
&& properly, so that we never get to them. Well, as long as we're only ever trying to compile from
STDIN anyway. It's a start.
The changes to handle it are absolutely minor: We need to add ':and' to the list of keywords in
compiler.rb, and define
compile_and (in 75d1ecd62d):
+ # Shortcircuit 'left && right' is equivalent to 'if left; right; end'
+ def compile_and scope, left, right
+ compile_if(scope, left, right)
But the test case also reveals another bug: When adding the caching of 'self', I failed to account
for the special handle of 'self' in the outermost scope (it's treated as a global variable). Maybe
treating the global scope any differently is the mistake, and we should just put that too on the
stack, but for now the fix is trivial: Just add a check for
- if arg.first == :lvar || arg.first == :arg
+ if arg.first == :lvar || arg.first == :arg || (arg.first == :global && arg.last == :self)
Doing a "./compile features/inputes/shortcircuit.rb" (or running the rspec tests) now yields a /tmp/shortcircuit that actually short-circuits.
features/inputs/scanner2.rb now successfully sidesteaps the methods on
File we're being
too lazy to implement. Though
features/inputs/scanner1.rb that uses the ternary conditional
still crashes, so there's that.
This is where we'll leave it for now.
This part probably seems all over the place, but it's part of the fun once you get to a stage where it becomes viable to start looking at bootstrapping the compiler in itself.
I'll make the next few parts shorter than some of the 3000+ word monsters I've put out in the past,
to also try to focus them a bit more, but next time will focus on getting
Scanner#unget to work, and then work our way through
Scanner#ws too. If you look
at our intended milestone of getting the s-expression parser to compile, it might look like that
brings us almost there.
Not so - look forward to complication after complication being revealed as we stumble into class after class, method after method, that we need to implement at least partially before we get there. And possibly some more simplifying compiler changes to sidestep some of them.
But incidentally, each one of them will make the next parts of the compiler easier to get past too.