Dataflow: Erlang-Style Thread Safety in Ruby

Larry Diehl, a.k.a. larrytheliquid, has just released Dataflow: a tiny and remarkable gem that helps Ruby programmers write thread-safe programs more easily by duplicating one of the main features of Erlang—and in my opinion the single most important feature that makes Erlang thread-safe. Dataflow makes all variables write-once (so the name “variable” isn’t really accurate any more). This limitation is really a feature. It makes it easier to write multithreaded programs without synchronization bugs because it’s no longer possible for two threads to write different values to the same variable, and thus there’s no need to synchronize writes. When you reference a variable that has not yet been assigned, Dataflow puts your thread to sleep automatically. It is reawakened automatically when the variable is assigned.

Before we continue, a word of caution: I’ve mentioned in this blog and in the Ruby on Rails Podcast that even though multithreading is really fun to think about and play with, I approach it with reluctance in real-life projects because it makes the code more complex, makes it a lot harder to debug problems, and is hard to manage when there are multiple programmers who all have to work in and understand the threaded code. But there are still some problems for which threading is the right solution.

I Like Stuff That’s Clean and Small

Dataflow is a beautiful bit of programming. It’s small, clean, and tested. It implements write-once variables with automated thread synchronization in just 52 lines of code. (Plus 120 lines of tests.) It supports:

  • instance variables
  • local variables
  • dynamic values loaded into data structures such as arrays
  • It doesn’t seem to support class variables but I guess a constant can serve as a write-once class variable.

Here are some code samples, copied from the README.

# Local variables
include Dataflow

local do |x, y, z|
  # notice how the order automatically gets resolved
  Thread.new { unify y, x + 2 }
  Thread.new { unify z, y + 3 }
  Thread.new { unify x, 1 }
  z #=> 6
end
# Instance variables
class AnimalHouse
  include Dataflow
  declare :small_cat, :big_cat

  def fetch_big_cat
    Thread.new { unify big_cat, small_cat.upcase }
    unify small_cat, 'cat'
    big_cat
  end
end

AnimalHouse.new.fetch_big_cat #=> 'CAT'
# Data-driven concurrency
include Dataflow

local do |stream, doubles, triples, squares|
  unify stream, Array.new(5) { local {|v| v } }

  Thread.new { unify doubles, stream.map {|n| n*2 } }
  Thread.new { unify triples, stream.map {|n| n*3 } }
  Thread.new { unify squares, stream.map {|n| n**2 } }

  Thread.new { stream.each {|x| unify x, rand(100) } }

  puts "original: #{stream.inspect}"
  puts "doubles:  #{doubles.inspect}"
  puts "triples:  #{triples.inspect}"
  puts "squares:  #{squares.inspect}"
end

It doesn’t take long to read the Dataflow code (it’s only 52 lines, after all) but it did take me a while stepping through it in the NetBeans debugger to wrap my head around how it works. Also the name of the variable-assignment method is a unintuitive to me. Assignment is done by calling the unify method. Apparently this name comes from the concept of unification, which I think means: provide a bunch of algorithms whose variables have dependencies on each other and let the system work out the dependencies and execute the algorithms in the correct order to assign values as they are needed. Anyway, using a method called unify for assignment takes a little getting used to.

Note: Larry’s README says Dataflow was inspired by the Oz programming language, not the Erlang programming language. But I’m more familiar with Erlang so that’s what I can compare it to. The primary difference between Ruby-with-Dataflow and Erlang is that in Dataflow you declare a variable and then assign it a value, whereas in Erlang you have to assign at the moment you declare it. That’s how Erlang makes variables write-once: if you can only assign a value when you declare a variable, obviously it will only be assigned once. Dataflow lets you assign to the same variable multiple times but raises an error if you assign different values, so it’s equivalent to write-once. (It uses the != operator to decide whether the values are equal.)

Interop with “Normal” Ruby

The README also says, “The nice thing is that many existing libraries/classes/methods can still be used, just avoid side-effects.”

It’s true that you can write a program that uses Dataflow for some variables but also interops with non-Dataflow code as long as that code is thread-safe. I’m not quite sure what he meant by “just avoid side-effects.”

But How Do You Assign New Values to Variables?

If a variable can only be written once, what do you do when you need to change it? Obviously programs need to deal with this. For example, what if you need to loop over an array and keep track of the index as you go? Erlang handles it by heavy use of the stack and threads, so whenever you need a new value you call a function (which spawns a thread) and the function declares a new variable, assigning the new value to it. So there’s a lot of copying of values.

In Ruby with Dataflow I imagine you would do something similar: either call a function or spawn a thread for each iteration, passing in the current value, and have the function or thread declare a new local variable which is value+1. This style of programming takes some time before it becomes natural. It’s not yet natural for me.

There could also be performance implications. Erlang’s interpreter optimizes tail recursion and converts it to an iteration (really a GOTO) under the hood so the stack doesn’t blow. I don’t know if any Ruby interpreters do that. As of a few years ago they didn’t, according to my Google search. Johannes Friestad wrote in 2005, “Recursion, tail or no tail, works just as well as any other method call in Ruby. Plenty of thrive without optimizing for tail recursion, Java is one of them. The combination of a small stack and lack of tail recursion optimization does mean that in Ruby, recursion can hardly replace every other looping construct the way it can in Lisp. You’ll be the judge of whether that is important.”

Update: Larry writes in the comments that “this library makes JRuby shine over MRI due to its green threads + native thread pool implementation.” I’ve only used MRI and I didn’t know about that aspect of JRuby but it’s pretty nice. It sounds like if you’re going to use Dataflow you might want to use it with JRuby rather than MRI.

Possible Concerns

Dataflow is really cool but I do have a few potential concerns about it:

  1. Even though Dataflow makes it easier to write thread-safe code, it doesn’t fix the fact that it’s hard to debug multithreaded code. Stepping through multithreaded code in a debugger is complicated, especially when the code switches thread context on the fly.
  2. Speaking of debugging, if the debugger tries to show you the value of a Dataflow variable that hasn’t yet been assigned, the debugger thread itself will be put to sleep. In NetBeans this means the “locals” pane stops working (but you can still debug) and if you hover the mouse over an unassigned variable, you don’t see anything in the tooltip. In rdebug it’s worse–if you eval a variable that doesn’t yet have a value, rdebug hangs because its main thread gets put to sleep.
  3. You can’t assign nil to a Dataflow variables because nil is used to indicate that it hasn’t yet been assigned. I would like to be able to assign a value of nil and have that be different from “unassigned.” This would be a pretty easy fix to make to Dataflow without bloating memory–all unassigned variables could reference the same constant:
    UNASSIGNED = Object.new
    I removed this concern because it’s been fixed. Dataflow now differentiates between nil and unassigned.
  4. Memory overhead: Dataflow is as efficient as possible with memory usage but it does incur some overhead on each variable. Compared to unthreaded programming, it is a lot. But compared to manual thread synchronization it’s probably about the same amount of memory you would have used for synchronization data structures anyway. It depends on how you do your manual synchronization. Each variable has, in addition to its value:
    1. a Mutex
    2. an Array (initially empty) of references to Threads that are waiting for it to be initialized
    3. a Monitor condition to wake up the Threads that are waiting for it to be initialized
    4. a Boolean to track whether it has a value yet (but cleverly, this boolean doesn’t get assigned until the variable is assigned, which saves some memory)
  5. More than the overhead per variable, I wonder about the memory overhead of constantly copying values rather than reassigning them. If the stack gets too deep you could run out of memory from all the copying. (See my description of looping above.) You also make the garbage collector work pretty hard. If you loop by spawning threads instead of using recursion, you incur a lot of overhead since threads are expensive compared to function calls. This is why Erlang’s interpreter has its own threading system instead of using the one in the operating system–threads have to be as cheap as function calls. In Ruby they are not.
  6. Related to memory overhead, I wonder about the performance overhead. In addition to deep stacks and lots of threads, every time you call a method on a variable it gets routed through method_missing and Mutex.synchronize even if you have already called that method on that variable. (It does this so its method_missing override can put your thread to sleep until the variable has a value.) This could be expensive but it’s impossible to know for sure without profiling it. If it turns out to be a problem, method_missing could rewrite itself the first time after the variable gets assigned a value so all subsequent calls don’t have to be synchronized.

That reads like a pretty big list of concerns but without actually using Dataflow I can’t tell how many of them will actually cause problems. I still think it’s cool. 🙂

Try it if You Need Threading

I mentioned at the beginning that I’m cautious about using threads but there are some problems for which they are the right solution. Next time I’m confronted with such a problem on a Ruby project I will drop in the Dataflow gem and give it a try. It looks like a pretty good way to do threading in Ruby.

Add Optional SEO-Friendliness to link_to_remote

link_to_remote_with_seo adds optional SEO-friendly goodness to the Rails link_to_remote function.  I wrote it for cases where I would have used link_to_remote in my Rails app but I wanted GoogleBot and other search engines to be able to follow the links.  In addition to setting onclick like the normal link_to_remote, it also sets html_options[:href] to the SAME URL that you pass in to options[:url]. (It only does this if you pass :seo => true and you do not explicitly set the href.)

See the big honking warning at the bottom for an explanation of why this plugin doesn’t just override the behavior of link_to_remote.

I Like Stuff that’s SEO-Friendly

The following example shows a “Next” link in paginated output.  Clicking the link in a browser results in an AJAX call (using the POST method) that retrieves just the “page” partial and inserts it into the “results” div on the page with a highlight visual effect.  When a search engine sees the link, however, it will send a GET request to the same URL, and the entire page (not just the partial) will be sent in the response.

Putting this in the view (home/index.html.erb):

<div id="results">
  <%= render :partial => "page" -%></div>
<%= link_to_seo_remote "Next",
  { :update => "#results",
    :url => { :action => "next_page" },
    :complete => visual_effect(:highlight, "#results") } %>

Produces (pay attention to the href attrbute):

<div id="results">
  <!-- first page of results shown here --></div>
<a href="/home/next_page"
  onclick="new Ajax.Updater('#results', '/home/next_page',
  {asynchronous:true, evalScripts:true,
  onComplete:function(request){new Effect.Highlight(&quot;#results&quot;,{});}}); return false;">
  Next
</a>

In  the controller (home.rb), render just the partial if called in an XHR (AJAX) request:

def next_page
  if request.xhr?
    render :partial => "page"
  else
    # Render the entire page, including the "results" section.
    render :action => "index"
  end
end

WARNING ABOUT INCORRECT USE OF THIS FUNCTION

Sorry but I have to yell for emphasis here.

When Google crawls your site it will follow all links on a page in advance, even before the user clicks on them.  Adding :confirm => “Are you sure?” WILL NOT HELP because it generates JavaScript that Google doesn’t execute.  So when you use link_to_seo_remote, DO NOT ALLOW destructive links to be placed in the href attribute.  Instead, override html_options[:href] to link to an intermediate page with “Are you sure?” and a BUTTON (not a link.  The crawler will not click the link, so the data will not be deleted.

See Using Rails AJAX Helpers to Create Safe State-Changing Links and search the page for “request.post?” for an explanation and some sample code.

Does it Have Tests?

Why, yes. I’d like to thank the Rails Community for not tolerating code with no tests. It was soooo tempting just to release this without writing automated tests but the peer pressure got to me.

And I’ll also like to thank Cake for awesome music.

To get the code

ruby script/plugin install http://github.com/BMorearty/link_to_remote_with_seo.git