Linking Outside of the Box: Optimizing Ruby with C/C++

Prototyping

My first attempt at a routine to XOR blocks (stored as Strings) looked something like this:

class String
  def xor!( str1, str2 )
    BLOCK_SIZE = 128.kilobytes
    (0...BLOCK_SIZE).each { |i| self[i] ^= str1[i] ^ str2[i] }
  end
end

There may very well be a faster way of doing this in pure Ruby, but I couldn’t find it (and didn’t want to waste the time). This worked well enough to finish the basic implementation and unit tests. And even though this prototype version was way too slow, it allowed me to build-out the higher-level code and tests so that went I went to replace it with the faster version, I had already established extensive code coverage (which revealed several bugs in my optimized implementation). Once the prototype is complete it’s time for …

Profiling

Premature optimization is the root of all evil. – Donald Knuth

Even though I knew that XOR was going to be the biggest cycle-sink, I decided that now would be a good time to learn about Ruby profiling. ruby-prof makes this fairly easy. I chose my biggest, most involved unit test, and wrapped it like so:

require 'ruby-prof'
def profile_lotsofstuff
  res = RubyProf.profile do
    test_lotsofstuff
  end
  RubyProf::GraphPrinter.new(res).print(STDOUT, 2)
end
alias_method :test_profile_lotsofstuff, :profile_lotsofstuff   if ENV['PROFILE']

This allows me to easily profile by running that unit test from the command line:

clear ; PROFILE=1 ruby test/unit/stuff_test.rb

While I’m still working on tweaking the various output parameters, the result did confirm my suspicions:

  %total   %self     total      self    children             calls   Name
--------------------------------------------------------------------------------
                      8.04      4.97      3.07                 4/4     String#xor!
  96.87%  59.88%      8.04      4.97      3.07                   4     Range#each
                      1.07      1.07      0.00     1048576/1048576     Fixnum#^
                      1.53      1.53      0.00     1572864/1572880     String#[]
                      0.47      0.47      0.00       524288/524288     String#[]=

Extending with C

Especially for tight nested loops like this, you quickly take a bit performance hit just from the loop overhead. In this case, I also couldn’t find an easy way to iterate through two strings simultaneously. After a while, it became quite annoying knowing that I could do the whole thing in a short little C function.

So that’s what I did. I created a small Rails plugin (xor), with appropriate init.rb, and added a lib/xor.c that looks something like:

VALUE string_xor( int argc, VALUE *argv, VALUE self ) 
{
  const char *src1 = STR2CSTR(argv[0]);
  const char *src2 = STR2CSTR(argv[1]);
  const char *dest = STR2CSTR(self);
  size_t len = RSTRING(self)->len;

  for ( ; len--; ++dest, ++src1, ++src2 )
    *dest ^= *src ^ *src2;

  return self;
}

void Init_xor( void )
{
  rb_define_method( rb_cString, "xor!", ((VALUE (*)(ANYARGS)) string_xor), -1 );
}

Using the Rake task for Ruby extensions from my RDBXML project makes it trivial to build with a small Rakefile:

require 'rake/extensiontask'
desc "Build the XOR extension"
Rake::ExtensionTask.new :xor do |t|
  t.dir = 'lib'
end

Added some test cases for the various XOR identities (e.g. x^0 = x, x^x = 0, etc.) and that was it. I kept the pure Ruby version of the function for later (renamed to slow_xor!).

Benchmarking

A few runs through the higher-level tests show a huge improvement in speed. But just how much is that? It’s easy to tell with Ruby’s built-in benchmarking support. As with the profiling, I find it convenient to hack it onto the existing unit tests, as they already prove a good source of stress-tests.

def benchmark_xor
  n = 10
  Benchmark.bm(8) do |bm|
    bm.report( "XOR (c) :" ) {  n.times { test_xor }  }
    String.module_eval { alias_method :xor!, :slow_xor! }
    bm.report( "XOR (rb):" ) {  n.times { test_xor }  }
  end
end
alias_method :test_benchmark_xor, :benchmark_xor  if ENV['BENCHMARK']
              user     system      total        real
XOR (c) :  1.200000   0.116667   1.316667 (  0.817143)
XOR (rb): 81.433333   0.683333  82.116667 ( 50.567826)

This shows a speed-up of a little over 60x. This changes the execution times for just about every operation from “minutes” into “seconds”. Not bad, eh?

Your Milleage May Vary