This is part of a series I started in March 2008 - you may want to go back and look at older parts if you're new to this series.
As mentioned last time, this time around, I've first landed a number of changes that will make the parser able to parse the entire compiler (whether or not it parses it correctly or not remains to be seen, and is outside the scope).
You can find these merged in 37d6a0510 and 44914839a72
Secondly, we'll start looking at small parts of the compiler to break into pieces that are:
The goal is to lay the groundwork for refactoring what we now have into smaller units that at the same time both allow us to unit test the module and use it as a test case for the compiler at the same time. Each module we can compile will then both give increased confidence in the generated code and bring us one step closer to the state where the compiler can fully compile itself.
A good set of candidates to start with appears to be the basic Scanner
class,
followed by the code to parse and print our "s-expression syntax". Will it prove
more challenging than expected? We'll see...
The remarkable thing is that "out of the box" ./compile scanner.rb
actually
yields a binary, though it will not-so-remarkably run into missing "stuff" on trying to run it:
$ /tmp/scanner
attr_reader col
define_method col
Method missing: __send__
Note that this also means that e.g. it quietly accepts that it has no
definition for Struct
. However that is actually in line with Ruby
semantics, and one of the things that makes efficiency such a problem:
ruby scanner.rb
runs all the way through (with no output), as it is
legal to reference an un-defined variable - it is first if we try to
access it we are meant to get an error.
In reality, for a compiler, we'll likely want to at least warn about the most obvious of these at least (I'm sure I can fill several articles on the topic of adding warnings for "legal but stupid" behaviour in semi-sane ways, but lets wait until we can actually compile a reasonably sized program first).
In its present form, Scanner
actually presents some challenges off
the bat just from a cursory look:
Struct
which we haven't done anything to implement (and
then re-opens the new class, which we don't yet support.attr_accessor
Tokens::Keyword
. Looking up Tokens::Keyword
will likely failThis is why self hosting is such a useful step - you're instantly faced with the practicalities of a real program written in the language. And while this program is in our control, making too drastic changes to it will defeat the purpose, so we have an incentive to fix things properly whenever reasonable.
We now have to consider whether to add support for these, whether full or temporary
hacks, or change Scanner
accordingly.
#respond_to?
: Doing this properly involves a Hash
of the methods of a class.
We'll make a note the method hash is needed, and defer that to a future part.
Instead we'll hack up a default #respond_to?
which is rater anaemic, and
override it for ScannerString
, which is the point here anyway.
We'll ignore the inner classes, and move ScannerString and Position out of Scanner.
Struct
: We'll change the code here, and define a "proper" Position
class
and defer Struct
until later.
attr_accessor
: We've had our painful trip down this road before. I'm not inclined
to revisit it anytime soon. We'll do this the hacky way, and special case attr_accessor
/
attr_reader
/ attr_writer
to custom generate methods for now at least (it may be the
best long term solution, though they were the "test case" for define_method
previously)
Blocks could / should work in theory. Lets see what is broken and fix it.
Tokens::Keyword
: Lets refactor this part. The code calling Tokens::Keyword
does not
belong in the Scanner
. Lets move it to ParserBase
, and thus defer this issue.
#is_a?
: Scanner#initialize
uses '#is_a?' to see if it has been given a File
or a stream.
File.file?
, File.expand_path
and #path
are used if we're given a File
object, to extract and expand the path name.
But first of all, lets start chopping Scanner
into pieces and creating test cases
to see what works and doesn't, and then gradually flesh out a more and more complete
equivalent.
We will need to be able to read from somewhere for the Scanner
class to work, so lets
first add a basic test of STDIN
and STDOUT
:
STDOUT.puts "Hello world from STDOUT"
}}}
This instantly fails, as we haven't even defined STDOUT. We'll remedy for now by adding
STDOUT = IO.new
to lib/core/core.rb
(in d24d56b)
Next we add a test of STDIN:
puts STDIN.getc
and modify our step definition to echo "test" into the compiled binaries. This will also fail because STDIN is missing. We make the same change as for STDOUT, but in this case it is not enough:
WARNING: __send__ bypassing vtable not yet implemented.
WARNING: Called with 0x8766bf8
WARNING: self = 0x8766b78
WARNING: (string: 'getc')
This one comes when there is no allocated vtable slot. The intent is for the class structures
to use vtable slots like in C++ for example for most methods. The heuristics used to allocate
vtable slots currently considers only methods we have seen defined, not methods that have
been otherwise mentioned. In this case #getc
has not been defined anywhere yet, and so
the compiler should have done a lookup in a hash table for this class, and then ended up
falling back on a generic #method_missing
, but the hash lookup is not yet implemented.
Anyway, lets put in place a #getc
in lib/core/io.rb
. This is harder than expected, as
our s-expression language does not expose a way to get addresses to stack variables easily.
So here's a first, horribly inefficient, stab (in 54a0c13)
# FIXME: This code is specific to a 32 bit little endian
# arch, and is also horribly inefficient because we don't
# have an easy way of getting the address of a stack allocated
# variable.
%s(do
(assign tmp (malloc 4))
(assign (index tmp 0) 0)
(read 0 tmp 1)
(assign c (__get_fixnum (index tmp 0)))
(free tmp)
)
c
end
Of course we can do a lot better by adding, say, an "(addr var)" construct. This is
still horribly bad, though - no error checking from read
; and it just reads from
STDIN
regardless of desired filedescriptor for this IO
object, and it does no
buffering, and other nastiness. Plenty to deal with later...
Our first test case from the actual Scanner class will be the only tricky part in #initialize
:
class Scanner
def initialize(io)
# set filename if io is an actual file (instead of STDIN)
# otherwhise, indicate it comes from a stream
@filename = io.is_a?(File) && File.file?(io) ? File.expand_path(io.path) : "<stream>"
puts @filename
end
end
Scanner.new(STDIN)
However this will fail:
Object#is_a? not implemented
Method missing: file?
Looking more closely into this, we'll find a few things: First I thought I'd not yet implemented
the ternary conditional, but the actual problem is that it fails to parse the ternary correctly
in the face of the logical and ('&&'). You can see this if you run the compiler like this: ruby compiler.rb --parsetree -norequire features/inputs/scanner1.rb
. This is likely "just" a priority issue.
But let's untangle this into two tests - one with the ternary if, and one without.
if io.is_a?(File) && File.file?(io)
@filename = File.expand_path(io.path)
else
@filename = "<stream>"
end
Changing the ternary to a full if will still prove interesting. Let us quickly hack in
a #is_a?
just for File
. First we need to add one to lib/core/object.rb
(882c71b) :
def is_a?(c)
false # We're pessimists
end
Now for the next surprise: We don't have false, do we? We've defined false/true in lib/core/core.rb
, but there they are ordinary local variables, only accessible in the main scope. We really ought to give them special treatment as fake global variables referring to global instances of FalseClass
and TrueClass
, but that opens another can of worms (we can no longer assume non-null means true and null means false, which will change comparisons.
For the time being we sidestep this by defining false in
Object` as a method (!) (I added that too in 882c71b) :
def false
%s(sexp 0)
end
This gives us another surprise when trying to compile features/inputs/scanner2.rb
(04b6060) ; the one without the ternary if:
We still get "Method missing file?", which means the second part of the supposedly short-circuiting '&&' expression gets executed, but we're not passing a File
object in.
This is yet another of our earlier convenient hacks that now needs to be sorted out: &&
gets turned into and
, which in turn gets interpreted as a method call because the compiler does not have a builtin and
construct. As a result, it is evaluated after its arguments. Or rather, it would have been, had not the second argument been a sub-expression which itself fails.
As a lesson in why proper testing is essential, let us use this as an excuse to skip File.file?
and File.expand_path
and File#path
this time around: Instead let use make the compiler handle &&
properly, so that we never get to them. Well, as long as we're only ever trying to compile from STDIN
anyway. It's a start.
The changes to handle it are absolutely minor: We need to add ':and' to the list of keywords in
compiler.rb
, and define compile_and
(in 75d1ecd62d):
+ # Shortcircuit 'left && right' is equivalent to 'if left; right; end'
+ def compile_and scope, left, right
+ compile_if(scope, left, right)
+ end
+
But the test case also reveals another bug: When adding the caching of 'self', I failed to account
for the special handle of 'self' in the outermost scope (it's treated as a global variable). Maybe
treating the global scope any differently is the mistake, and we should just put that too on the
stack, but for now the fix is trivial: Just add a check for [:global,:self]
in get_arg
:
- if arg.first == :lvar || arg.first == :arg
+ if arg.first == :lvar || arg.first == :arg || (arg.first == :global && arg.last == :self)
Doing a "./compile features/inputes/shortcircuit.rb" (or running the rspec tests) now yields a /tmp/shortcircuit that actually short-circuits.
And features/inputs/scanner2.rb
now successfully sidesteaps the methods on File
we're being
too lazy to implement. Though features/inputs/scanner1.rb
that uses the ternary conditional
still crashes, so there's that.
This is where we'll leave it for now.
This part probably seems all over the place, but it's part of the fun once you get to a stage where it becomes viable to start looking at bootstrapping the compiler in itself.
I'll make the next few parts shorter than some of the 3000+ word monsters I've put out in the past,
to also try to focus them a bit more, but next time will focus on getting Scanner#get
and Scanner#unget
to work, and then work our way through Scanner#expect
and Scanner#ws
too. If you look
at our intended milestone of getting the s-expression parser to compile, it might look like that
brings us almost there.
Not so - look forward to complication after complication being revealed as we stumble into class after class, method after method, that we need to implement at least partially before we get there. And possibly some more simplifying compiler changes to sidestep some of them.
But incidentally, each one of them will make the next parts of the compiler easier to get past too.