This is part of a series I started in March 2008 - you may want to go back and look at older parts if you're new to this series.
One of the great "joys" of trying to compile Ruby is of course the level of dynamism. Especially when trying, as I am, to make the compiler as static as possible.
One of the culprits is the ability to send messages to objects that may be composed dynamically (e.g. formatting strings), and that may or may not even exist at compile time.
So far we've avoided this. There are two issues here:
We need to be able to handle methods with names that are not present in the application at compile time. Currently we put everything into vtables, and to know how to allocate slots we need to know the names at compile time.
We need to be able to look up methods by name, and do something with the information.
Matz' Ruby implementation does this with a hash table of methods. And we will too. This does, however, not mean we'll ditch the vtables. The vtables are essential to get fast method calls for methods we call by names statically known, which for most applications will likely be the vast majority.
Even for method calls where the target method was not defined at compile time this works, by pre-allocating a vtable slot.
Instead what we will do is add redundant information for the methods known at compile time, and augment this with the ability to add new methods down the line.
The initial motivation for this is the work towards self-hosting, and the
compiler itself, foolishly or not, uses #send
to call the #parse_XXX
methods in Parser
for example.
The last time I wrote a hash table implementation from scratch was more than 20 years ago, in my introductory data structures and algorithms course. I read the book on a long train journey and mostly ignored the lessons. Apart from frustrating everyone by insisting on doing a N-Queens exercise in Amiga-E (a wonderful language by the amazing Wouter van Oortmerssen, though unfortunately rather esoteric even back in '94)
Thankfully, making a simple hash table is, well, simple.
We'll do a very basic linear scan hash table. If you've forgotten your CS courses and is too lazy to click the wikipedia link, that means that once we've calculated which slot an entry should ideally go in, if it is occupied, we linearly probe the other slots until we find an empty one.
There are all kinds of upsides and downsides from various methods for scanning/probing, and from using scans instad of alternative structures, but the point as usual is simplicity first. Especially in this case as we won't even know the access patterns until things are complete enough to do proper benchmarks.
Firstly, let us implement #hash
. The documentation for Object#hash
says this:
Generates a Fixnum hash value for this object. This function must have the property that a.eql?(b) implies a.hash == b.hash.
The hash value is used along with eql? by the Hash class to determine if two objects reference the same hash key. Any hash value that exceeds the capacity of a Fixnum will be truncated before being used.
The hash value for an object may not be identical across invocations or implementations of Ruby. If you need a stable identifier across Ruby invocations and implementations you will need to generate one with a custom method.
This is thankfully easy to match, and we'll do it with the DJB hash function used in amongst others CDB. It's simple, and reasonably efficient:
+ # DJB hash
+ def hash
+ h = 5381
+ each_byte do |c|
+ h = h * 33 + c
+ end
+ h
+ end
You'll note we use a multiply by 33 instead of the ((h << 5) + h)
in the description. The
real reason for this is that I haven't implemented <<
yet. But in our case, given Ruby's
pathological amount of method calls, h * 33
is actually even likely to be faster than ((h << 5) + h)
until we actually add some fairly advanced optimization (so ca. 2030 at my current pace).
Note that our function is not entirely equivalent to the original djb hash function, as the original assumes truncation to a 32 bit value. It has little practical implication here given that we will be using a modulo of it in the actual code.
We'll also add String#eql?
, which in this case is just an alias for #==
:
+ def eql? other
+ self.== other
+ end
(In 6b01ee3)
Since I happen to know we'll want a working modulus operator (because we'll want to take the hash value of a string, and take the modulus of the hash value and the capacity of the Array we'll store the hash table in to find out where to start probing), that's what we'll fix next.
We start out gently in 3f24489 by adding it to the operator table in operators.rb
, and
in e6bb8c7 we add the mod
keyword to the s-expression syntax to give us something to
implement it with.
In 88066af we implement the mod
keyword:
--- a/compile_arithmetic.rb
+++ b/compile_arithmetic.rb
@@ -46,8 +46,16 @@ class Compiler
@e.movl(@e.result, dividend)
@e.sarl(31, dividend)
@e.idivl(divisor)
+
+ yield if block_given?
end
end
Value.new([:subexpr])
end
+
+ def compile_mod(scope, left, right)
+ compile_div(scope,left,right) do
+ @e.movl(:edx, @e.result)
+ end
+ end
end
Piece of cake as div
already does the calculation, we just need to actually
save the modulus that's already calculated.
%
is one of the most heinous abominations of Ruby syntax. I'm not talking
about the modulus operator. I'm talking about the other %
. The evil twin.
You have met it. It's the %
that starts quoted strings og all kinds. We've
"stolen" one of the codes for the s-expression syntax.
Only %
is really horrendously ambiguous. For example, consider what this
expression should parse as:
% x
In a sane, just world, this would be a syntax error.
Nah. Too obvious, isn't it? Can't be anything but "x"
. Uh? What?!?
As it happens, in contexts where it can't validly be the infix modulus opeator,
%
starts a quoted string even when the following character is a space.
For characters outside of a specific set of exceptions with special meaning,
%
will treat the character immediately following as the "quote" character
that will start and terminate the given string. So above, we're basically using
space as the equivalent of a quote.
(If you have ever seen that used other than as a way of terrorising language
implementors or other unsuspecting developers victims, I want to know)
Of course this broke the parser, as we've consistently gone for simplicity first...
First of all, let's get some test cases in place (in 8473eaecb):
@mod
Scenario Outline: Simple expressions
Given the expression <expr>
When I parse it with the full parser
Then the parse tree should become <tree>
Examples:
| expr | tree | notes |
| "% x " | [:do,"x"] | |
| "a + % x " | [:do,[:+,:a,"x"]] | |
| "1 % 2 " | [:do,[:%,1,2]] | |
| "1 % 2" | [:do,[:%,1,2]] | |
Then we fix it. As it happens the fix is simple. As horrendous as this syntax looks, at least the cases we're concerned with now are just a matter of distinguishing between a prefix and infix operator, and the shunting yard parser has a mechanism for that. First we change the operator table (in c276794):
--- a/operators.rb
+++ b/operators.rb
@@ -97,7 +97,10 @@ Operators = {
:prefix => Oper.new( 20, :-, :prefix)
},
"!" => Oper.new( 10, :"!", :prefix),
- "%" => Oper.new( 20, :"%", :infix),
+ "%" => {
+ :infix_or_postfix => Oper.new( 20, :"%", :infix),
+ :prefix => Oper.new( 20, :quoted_exp, :prefix)
+ },
"/" => Oper.new( 20, :/, :infix),
"*" => {
:infix_or_postfix => Oper.new( 20, :"*", :infix),
This basically makes the parser accept both versions in their respective contexts.
Then we need to handle it. As it happens we can use the opportunity
for some minor cleanup. We already added code to special case the
operator case of '%', and that's actually what broke the modulo
operator case. So now we rip that out, and add a helper #get_quoted_exp
to let the parser choose whether to call back in and parse a quoted
expression (and make use of that helper one other place as well) (in 3815fb2):
--- a/tokens.rb
+++ b/tokens.rb
@@ -143,10 +143,15 @@ module Tokens
@s.unget(s)
end
+ def get_quoted_exp(unget = false)
+ @s.unget("%") if unget
+ @s.expect(Quoted) { @parser.parse_defexp } #, nil]
+ end
+
def get_raw
case @s.peek
when ?",?'
- return [@s.expect(Quoted) { @parser.parse_defexp }, nil]
+ return [get_quoted_exp, nil]
when DIGITS
return [@s.expect(Number), nil]
when ALPHA, ?@, ?$, ?:, ?_
@@ -182,11 +187,6 @@ module Tokens
else
# Special cases - two/three character operators, and character constants
- if @s.peek == ?%
- r = @s.expect(Quoted) { @parser.parse_defexp }
- return [r,nil] if r
- end
-
Note that in the operator case, the rest of that code have already 'eaten' the '%' so we need an option to unget it.
We also expose the #get_quoted_exp
helper in tokenizeradapter.rb
.
And finally in shunting.rb
we trigger on quoted_exp
and call back into the
tokenizer via the adapter (in 3815fb2):
+ def parse_quoted_exp
+ @tokenizer.get_quoted_exp
+ end
+
def shunt(src, ostack = [], inhibit = [])
possible_func = false # was the last token a possible function name?
opstate = :prefix # IF we get a single arity operator right now, it is a prefix operator
@@ -90,6 +94,8 @@ module OpPrec
else
raise "Block not allowed here"
end
+ elsif op.sym == :quoted_exp
+ @out.value(parse_quoted_exp)
else
if op.type == :rp
@out.value(nil) if lastlp
Finally we can implement the '%' operator in Fixnum (in 9f6f151):
--- a/lib/core/fixnum.rb
+++ b/lib/core/fixnum.rb
@@ -5,6 +5,16 @@ class Fixnum < Integer
%s(assign @value 0)
end
+ def % other
+# %s(printf "%i\n" (callm other __get_raw))
+ %s(assign r (callm other __get_raw))
+ %s(assign m (mod @value r))
+
+ %s(if (eq (gt m 0) (lt r 0))
+ (assign m (add m r)))
+ %s(__get_fixnum m)
+ end
+
Note the last if
expression, which is used because the x86 instruction set and Ruby disagrees on how to
handle modulo operations with negative numbers, and I opted to stick with code similar to what gcc outputs
for C rather than encode the Ruby behaviour in the s-exp syntax. I may yet change my opinion on that - we'll
see, but for now this works.
In addition to the above, in 928fe97 I add support for Array.new(capacity)
and Array#capacity
, as well
as make Array#[]=
work for the basic cases, as this will be needed for the Hash
implementation below.
In 913a3de you will find an initial implementation of Hash
. Notably, this implementation is pure Ruby,
and (basic, simple) programs will not even necessarily break if you load this implementation in MRI and
proceed to use it. It also maintains insertion order, like MRI 1.9+, which aids in testing vs. MRI.
It does, however, lack most methods, but it should be sufficient for us for the next steps. It is worth noting,
though, that it is certainly still inefficient in eg. the interaction with Array
, and there's plenty of
scope for improvement in the future.
I'll leave figuring the hash implementation out as an exercise for the reader (or ask in the comments), as there's nothing specific to the compiler in that one and it should be quite simple: Basically I maintan an array of slots consisting of pointers - one to the key, one to the value, and a linked list to maintain insertion order. Then on setting and reading the hash table, we calculate the hash of the key, look up the first applicable slot, then we loop over the hash table until we've found a matching key, possibly wrapping around, or we find an empty slot.
If the table is too small, we re-allocate the array and re-insert all the key/value pairs.
We could do this more efficiently for special cases like the vtable where we don't need the insertion order, for example, but as usual, efficiency for later.
But in the interest of (some) efficiency as in this case it also brings us closer to feature-completeness, I will first make one further improvement to Symbol handling.
Since we now have a Hash table, we can store a hash mapping Symbol names to Symbol instances, and
thus gain the two main advantages Symbol
is meant to have: Ability to treat object identity as equality
(in other words: compare pointers rather than string values in our case), and secondly reusing the same
object instance each time the symbol is mentioned.
We'll use this to make the cost of the class method tables less excessive.
Firstly, to make Symbol
work with these changes, it was necessary to move it later in the bootstrap of the
runtime (in fc674f3). This is one of the things I find most interesting with working on this compiler:
Slowly teasing out a minimal core of Ruby and figuring out how to bootstrap just the lower level facilities
and bootstrap. That's going to be an interesting subject in its own right at some point.
Secondly we implement Object#object_id
as simply returning %s(__get_fixnum self)
. So (and this should not
be relied on), #object_id
in our case is actually the fixnum version of the pointer to the object.
We strictly don't need this, but it's used by my test case, and also lets me isolate the s-expression parts
further, implementing Object identity comparisons (note: these actually belongs in Comparable, but we don't
do include
and friends yet). (In 38ca5270e3)
These leave us with a very small manageable set of changes to Symbol
. Small enough to just include the
whole rudimentary Symbol
class here (changes can be seen in 9abb565, and followup in 077dcf3):
class Symbol
@symbols = {}
The above is our symbol-name to Symbol
hash table.
Cleaned it up to use proper String objects for the name
(though #to_s
should probably use #dup
, which I don't
think we've implemented yet)
# FIXME: Should be private, but we don't support that yet
def initialize(name)
@name = name
end
def to_s
@name
end
def hash
to_s.hash
end
# FIXME
# The compiler should turn ":foo" into Symbol.__get_symbol("foo").
# Alternatively, the compiler can do this _once_ at the start for
# any symbol encountered in the source text, and store the result.
def self.__get_symbol(name)
sym = @symbols[name]
if sym.nil? ## FIXME: Doing !sym instead fails w/method missing
sym = Symbol.new(name)
@symbols[name] = sym
end
sym
end
end
%s(defun __get_symbol (name) (callm Symbol __get_symbol ((__get_string name))))
Here is the fun: Look up the symbol; if not present in the hash table, create a new one, and return it.
One notable thing: MRI recently added garbage collection of symbols. The above actively works counter to garbage collection of symbols - once we get to add a GC we'll want to deal with that by letting the GC handle certain structures, such as this hash, as "weak" references that can be broken. It's an extra complexity, but far away.
Eventually we are going to use two separate types of hash tables:
Symbol
=> offset in vtable for statically allocated methods, and
Symbol
=> address of method for runtime defined methods. This will
keep the class specific hash tables small at the cost of one extra hash
table lookup in #send
, but #send
will always be slow compared to the
direct vtable lookup anyway.
This time we will only actually implement the former, and only handle
cases of "#send" called with method names we've statically seen. This
actually covers a substantial proportion of the use of #send
. The big
exception is cases where a method is both dynamically defined and
dynamically used by string composition or by requiring files at load
time (which we don't support yet). Most importantly at this stage it
handles the specific case of bootstrapping the compiler itself.
We'll declare it out of bounds for the compiler itself to use #send
before a suitable point in the bootstrap process, where we call a compiler
generated function to bootstrap this hash table.
Calling #send
before that will trigger a runtime error.
In 5a269ec I add a new file lib/core/class_ext.rb
that implements
the send "send" logic, and the code to bootstrap these values. This goes
in a separate file for the simple reason that it needs to be loaded after
Symbol
and various other things have gotten bootstrapped, but the basics
of Class
are needed much earlier.
First of all, this is how the send happens:
# This is called by __send__
def __send_for_obj__ obj,sym,*args
voff = Class.method_to_voff[sym]
if !voff
%s(printf "WARNING: __send__ bypassing vtable (name not statically known at compile time) not yet implemented.\n")
%s(if sym (printf "WARNING: Method: '%s'\n" (callm (callm sym to_s) __get_raw)))
%s(printf "WARNING: symbol address = %p\n" sym)
%s(printf "WARNING: object = %p\n" obj)
%s(printf "WARNING: class '%s'\n" (callm (callm self name) __get_raw))
nil
else
# We can't inline this in the call, as our updated callm
# doesn't allow method/function calls in the method slot
# for simplicity, for now anyway.
%s(assign raw (callm voff __get_raw))
%s(callm obj (index self raw) ((splat args)))
end
end
There's not that much new or different here. Mainly we look up the method
name in the new hash table (which should not be public, but we'll worry
about that later). Secondly, if the vtable offset is known - which will
be the case for all methods we've seen statically mentioned - we attempt
to convert the stored Fixnum
to a raw integer, retrieve the address from
the vtable, and do a callm
with the address instead of a method name.
This latter part is new, and we'll have a look at how to handle that shortly,
but first the last part of lib/core/class_ext.rb
:
#
# Populate the Class.method_to_voff hash based on the __vtable_names
# table that gets generated by the compiler.
#
%s(let (i max)
(assign i 0)
(assign max __vtable_size)
(assign h (callm Class method_to_voff))
(while (lt i max)
(do
(if (ne (index __vtable_names i) 0)
(callm h []= ((__get_symbol (index __vtable_names i)) (__get_fixnum i)))
)
(assign i (add i 1))
)
)
)
This is fairly straight forward: We iterate over a table name __vtable_names
that the compiler will henceforth output, and for __vtable_size
entries we
obtain a Symbol
object, and then add a mapping from the ymbol to the according
offset to the Class.method_to_voff
hash table.
This table was actually added in e6bb8c75 - I think that method in Compiler
should be reasonable self-explanatory - basically outputs an array of pointers to
constant strings.
Previously %s(callm ...)
have only supported names. Now it needs to supports
%s(callm object <some expression resulting in method address> ...)
too.
We handle this as follows. Firstly, we (in 46bb9d0) we check if the method
is provided by name (as a Symbol
), and only use the old logic if it is:
+ if method.is_a?(Symbol)
+ off = @vtableoffsets.get_offset(method)
+ if !off
Then further down, we conditionally compile the actual call differently:
- @e.callm(m)
+ if off
+ @e.callm(m)
+ else
+ # NOTE: The expression in "method" can not
+ # include a function call, as it'll clobber
+ # %ebx
+ @e.call(compile_eval_arg(scope,method))
+ end
I'm not quite pleased with the complexity of compile_call.rb
, but it'll do for now.
And that's in. Doing #send
now works. And we have a somewhat working Hash
.
Not bad.
And with this I will start doing what I've mentioned for a while now and switch to writing shorter, more focused entries on specific features rather than these huge parts. And feature branches instead of branches focused on a specific part....
Hopefully it will get things moving a bit quicker.
(This also signals that I'll be more open to outside contributions and suggestions, though I recognise this probably isn't the easiest codebase to jump into, despite the copious amount of writing I've done about it's internals)