This is part of a series I started in March 2008 - you may want to go back and look at older parts if you're new to this series.
This part is going to be a hodge-podge of changes, as I walk through a good chunk of changes needed to
prepare to compile the tokenization code in tokenizer.rb
, and improves what we can handle from scanner.rb
.
I'm going to start by briefly talking about this commit:
This commit breaks tokenizer.rb
into several separate files. And that was the starting point for this
part. To identify what I had to do, I broke apart the Tokenizer
module, and wrote a few small scripts
to see what compiled, and what crashed. E.g:
require 'selfhost/atom'
require 'selfhost/scanner'
s = Scanner.new(STDIN)
i = Tokens::Atom.new
puts i.expect(s)
(I haven't checked in the "selfhost" directory - the files there are mostly like the broken up tokenizer.rb
)
The purpose was to tweak the classes however much I needed to compile them, then fix or tweak until I got them mostly working, based on whatever errors I run into.
The purpose of this is of course to get one step closer to the first big "selfhosting" goal of being able to run the s-expression parser.
Rather than following the entire process, which was simple yet tedious, I'll walk through the most important commits, and simply list the others.
I'm going to skip over these, other than mentioning the commits. They are all important, but trivial. If you have any questions about them, I'm happy to answer comments:
Until now, if a method did not specify a return value, we'd leave garbage in %eax
. Ruby does in fact
require nil
to be returned in most instances where there's not a sane last expression value to depend
on.
For a while I was baffled at why the not-operator methods required an extra argument, and used workarounds:
Here I fixed it:
The specific line that fixes it was this from transform.rb
:
e[3] = E[e[2]] if e[2]
Previously this was done unconditionally, which meant that if e[2] was already an empty array, we wrapped it in another array and unintentionally gave a method call an extra, though invalid, argument.
The tokenizer classes relies on ranges in a number of locations. This commit takes a first tiny stab at
a basic version of the Range class, and also adds support for literal ranges (1..5), through this bit added to transform.rb
:
+ def rewrite_range(exp)
+ exp.depth_first do |e|
+ if e[0] == :range
+ STDERR.puts e.inspect
+ e.replace(E[:callm, :Range, :new, e[1..-1]])
+ end
+ :next
+ end
+ end
+
What this is talking about (unfortunately the commit is really messy and includes unrelated whitespace and documentation changes etc - I was sloppy; sorry about that), is for example the difference between these expressions:
foo(45 + 5)
foo 45 + 5
Prior to this change, these were handled differently. The synthetic "call" operator used by the shunting
yard parser needs a high priority to bind tightly to the function arguments. But in the absence of parentheses, this means it bound too tightly to the tokens immediately following, so that the latter example above would be treated as foo(45) + 5
because the call operator was given higher precedence than +
.
This was resolved by introducing a second call operator, and in the instance where we are handling functions without a leading parenthesis, that operator (with much lower priority) is pushed onto the operator stack instead:
if possible_func
reduce(ostack)
- ostack << opcall
+ ostack << @opcall2
end
As it turned out, my previous attempt at implementing ||
was completely broken. I don't know what
I was thinking. All evidence is that I was not.
The original "||" implementation basically turned <left> || <right>
into if <left>; false; else <right>; end
. Which sort-of works, except it discards the result from <left>
, which obviously isn't
acceptable.
The first commit above improves that every so slightly by changing the implementation into the equivalent of this pseudo code:
__left = <left>
if __left
__left
else
<right>
end
But that really was not a very satisfying solution. So the next commit changes it drastically, and
also refactors #compile_if
and #compile_while
, and at the same time takes a stab at handling
our minimal static typing in a reasonable way. Here's how #compile_or
changed:
def compile_or scope, left, right
@e.comment("compile_or: #{left.inspect} || #{right.inspect}")
- # FIXME: Eek. need to make sure variables are assigned for these. Turn it into a rewrite?
- compile_eval_arg(scope,[:assign, :__left, left])
- compile_if(scope, :__left, :__left, right)
+
+ ret = compile_eval_arg(scope,left)
+ l_or = @e.get_local + "_or"
+ compile_jmp_on_false(scope, ret, l_or)
+
+ l_end_or = @e.get_local + "_end_or"
+ @e.jmp(l_end_or)
+
+ @e.comment(".. or:")
+ @e.local(l_or)
+ or_ret = compile_eval_arg(scope,right)
+ @e.local(l_end_or)
+
+ @e.evict_all
+
+ combine_types(ret,or_ret)
end
Which does roughly this, as pseudo-code:
ret = <left>
if !ret goto <label>_or
goto <label>_end_or
<label>_or:
<right>
<label>_end_or:
You'll note this is not optimal, we can improve on it by adding code to avoid one of the branches,
but this was a decent shortcut to let it still share code with #compile_if
and #compile_while
.
#compile_jmp_on_false
and #combine_types
are worth a quick look. #compile_jmp_on_false
extracts
part of #compile_if
as follows:
+ def compile_jmp_on_false(scope, r, target)
+ if r && r.type == :object
+ @e.save_result(r)
+ @e.cmpl(@e.result_value, "nil")
+ @e.je(target)
+ @e.cmpl(@e.result_value, "false")
+ @e.je(target)
+ else
+ @e.jmp_on_false(target, r)
+ end
+ end
As you can see, the main reason it has been given its own method is to handle the case where we check the truthyness of an object.
#combine_types
looks like this:
+ def combine_types(left, right)
+ type = nil
+ if left && (!right || left.type == right.type)
+ type = left.type
+ end
+ return Value.new([:subexpr],type)
+ end
This was also extracted from #compile_if
, and is responsible for ensuring the weaken the type
information. Which is the fancy way of saying that if the type of the left hand expression matches
that of the right hand expression, we return that type. Otherwise, we return "nil" in the type field,
which in our current incarnation means "we have no clue, this could be anything" which means we
won't take any object-specific action elsewhere.
Scope#add_constant
and Scope#find_constant
Next up we need to handle inner classes. The first step is to actually capture Foo::Bar
as different to Foo.bar
, as it was previously. We do this by introducing the internal :deref
operator in the parse tree in 6ceb215 in operators.rb
:
- "::" => Oper.new(100, :callm, :infix, 2,2,:left),
+ "::" => Oper.new(100, :deref, :infix, 2,2,:left),
The next step is a major change to build a sequence of class scope objects. Consider the case where I open a class, define a method that refers to an as yet undefined class. Then later I re-open the class and define the earlier class as an inner class:
class Foo
def hello
Bar.new
end
end
class Foo
class Foo
class Bar
end
end
end
To handle this case, ClassScope
objects must persist across open/close of a class, and they do.
However, to compile this to static references, I also must identify any references and resolve them,
to be able to distinguish a possible ::Bar
from ::Foo::Bar
without being forced to do runtime
lookups (we still need to be able to fall back to dynamic constant lookup at some point in the future).
So lets take a look at the code, in transform.rb
:
+ def build_class_scopes(exps, scope = nil)
+ scope = @global_scope if !scope
+
+ return if !exps.is_a?(Array)
+
+ exps.each do |e|
+ if e.is_a?(Array)
+ if e[0] == :defm && scope.is_a?(ClassScope)
+ scope.add_vtable_entry(e[1]) # add method into vtable of class-scope to associate with class
+
+ e[3].depth_first do |exp|
+ exp.each do |n|
+ scope.add_ivar(n) if n.is_a?(Symbol) and n.to_s[0] == ?@ && n.to_s[1] != ?@
+ end
+ end
Firstly, above, we deal with method definitions, by identifying instance variables that are plainly mentioned (we still don't support dynamically generated ivars).
Then this bit was extracted from compile_class
as we may as well allocate these earlier:
+ elsif e[0] == :call && e[1] == :attr_accessor
+ # This is a bit presumptious, assuming noone are stupid enough to overload
+ # attr_accessor, attr_reader without making them do more or less the same thing.
+ # but the right thing to do is actually to call the method.
+ #
+ # In any case there is no actual harm in allocating the vtable
+ # entry.`
+ #
+ # We may do a quick temporary hack to synthesize the methods,
+ # though, as otherwise we need to implement the full define_method
+ # etc.
+ arr = e[1].is_a?(Array) ? e[2] : [e[2]]
+ arr.each {|entry|
+ scope.add_vtable_entry(entry.to_s[1..-1].to_sym)
+ }
And then we handle the case of the class
or module
blocks themselves, by creating the
ClassScope
objects that will hold the scope information, and recursing, as well as finally
recursing for all other AST node types:
+ elsif e[0] == :class || e[0] == :module
+ superclass = e[2]
+ superc = @classes[superclass.to_sym]
+ cscope = @classes[e[1].to_sym]
+ cscope = ClassScope.new(scope, e[1], @vtableoffsets, superc) if !cscope
+ @classes[cscope.name.to_sym] = cscope
+ @global_scope.add_constant(cscope.name.to_sym,cscope)
+ scope.add_constant(e[1].to_sym,cscope)
+ build_class_scopes(e[3], cscope)
+ elsif e[0] == :sexp
+ else
+ e[1..-1].each do |x|
+ build_class_scopes(x,scope)
+ end
+ end
+ end
+ end
+ end
And at the top level in transform.rb
we hook this step in early, starting by initializing the
GlobalScope
object, so we have something to hook the class scopes into:
def preprocess exp
+ # The global scope is needed for some rewrites
+ @global_scope = GlobalScope.new(@vtableoffsets)
+ build_class_scopes(exp)
+
There are a number of smaller supporting changes, but lets just look at compile_deref
that
handles the new internal :deref
node as a starting point:
+ def compile_deref(scope, left, right)
+ cscope = scope.find_constant(left)
+ raise "Unable to resolve: #{left}::#{right} statically (FIXME)" if !cscope || !cscope.is_a?(ClassScope)
+ get_arg(cscope,right)
+ end
This simply delegates to the new find_constant
methods in the scope classes, which for ClassScope
recurses and builds a multi-level class name (e.g. Foo__Bar
) - in classcope.rb
:
+ def find_constant(c)
+ const = @constants[c]
+ return const if const
+ return @next.find_constant(@name+"__"+c.to_s) if @next
+ return nil
+ end
There's not much to say about this. The whole change simply adds a small bit to treeoutput.rb
that on coming across [:incr, :foo]
insteads translates it into [:callm, :foo, :"+", right hand expression
.
This implements a very basic #is_a?
implementation that simply loops up the superclass
chain.
It required the implementation itself, but also adding a basic Object#==
that does pointer based
comparison, and Class#superclass
itself.
The above is a one-liner that helps debugging by ensuring the position doesn't get accidentally set to nil.
A while back I was holding back on this awaiting a full define_method
implementation, so I could do this in pure Ruby, but since I pulled back on that, it's time for a simple implementation since it helps actually get closer to compile the unmodified scanner. The bulk of is this in transform.rb
, which simply injects one or two method definitions:
+
+ # Then let's do the quick hack:
+ #
+
+ type = e[1]
+ syms = e[2]
+
+ e[0] = :do
+ e.slice!(1..-1)
+ syms.each do |mname|
+ mname = mname.to_s[1..-1].to_sym
+ if (type == :attr_reader || type == :attr_accessor)
+ e << E[:defm, mname, [], ["@#{mname}".to_sym]]
+ end
+ if (type == :attr_writer || type == :attr_accessor)
+ e << E[:defm, "#{mname}=".to_sym, [:value], [[:assign, "@#{mname}".to_sym, :value]]]
+ end
+ end
I won't comment further on these:
My current plans (subject to changes):
#send
and according method lookup
machinery (this also affects the main Compiler
class). There's also the issue of exceptions, though I'm unsure if I want to deal with that yet, as well as #include
'ing modules.These may be hopelessly optimistic:
eval()
.