Dataflow: Erlang-Style Thread Safety in Ruby

Larry Diehl, a.k.a. larrytheliquid, has just released Dataflow: a tiny and remarkable gem that helps Ruby programmers write thread-safe programs more easily by duplicating one of the main features of Erlang—and in my opinion the single most important feature that makes Erlang thread-safe. Dataflow makes all variables write-once (so the name “variable” isn’t really accurate any more). This limitation is really a feature. It makes it easier to write multithreaded programs without synchronization bugs because it’s no longer possible for two threads to write different values to the same variable, and thus there’s no need to synchronize writes. When you reference a variable that has not yet been assigned, Dataflow puts your thread to sleep automatically. It is reawakened automatically when the variable is assigned.

Before we continue, a word of caution: I’ve mentioned in this blog and in the Ruby on Rails Podcast that even though multithreading is really fun to think about and play with, I approach it with reluctance in real-life projects because it makes the code more complex, makes it a lot harder to debug problems, and is hard to manage when there are multiple programmers who all have to work in and understand the threaded code. But there are still some problems for which threading is the right solution.

I Like Stuff That’s Clean and Small

Dataflow is a beautiful bit of programming. It’s small, clean, and tested. It implements write-once variables with automated thread synchronization in just 52 lines of code. (Plus 120 lines of tests.) It supports:

  • instance variables
  • local variables
  • dynamic values loaded into data structures such as arrays
  • It doesn’t seem to support class variables but I guess a constant can serve as a write-once class variable.

Here are some code samples, copied from the README.

# Local variables
include Dataflow

local do |x, y, z|
  # notice how the order automatically gets resolved { unify y, x + 2 } { unify z, y + 3 } { unify x, 1 }
  z #=> 6
# Instance variables
class AnimalHouse
  include Dataflow
  declare :small_cat, :big_cat

  def fetch_big_cat { unify big_cat, small_cat.upcase }
    unify small_cat, 'cat'
end #=> 'CAT'
# Data-driven concurrency
include Dataflow

local do |stream, doubles, triples, squares|
  unify stream, { local {|v| v } } { unify doubles, {|n| n*2 } } { unify triples, {|n| n*3 } } { unify squares, {|n| n**2 } } { stream.each {|x| unify x, rand(100) } }

  puts "original: #{stream.inspect}"
  puts "doubles:  #{doubles.inspect}"
  puts "triples:  #{triples.inspect}"
  puts "squares:  #{squares.inspect}"

It doesn’t take long to read the Dataflow code (it’s only 52 lines, after all) but it did take me a while stepping through it in the NetBeans debugger to wrap my head around how it works. Also the name of the variable-assignment method is a unintuitive to me. Assignment is done by calling the unify method. Apparently this name comes from the concept of unification, which I think means: provide a bunch of algorithms whose variables have dependencies on each other and let the system work out the dependencies and execute the algorithms in the correct order to assign values as they are needed. Anyway, using a method called unify for assignment takes a little getting used to.

Note: Larry’s README says Dataflow was inspired by the Oz programming language, not the Erlang programming language. But I’m more familiar with Erlang so that’s what I can compare it to. The primary difference between Ruby-with-Dataflow and Erlang is that in Dataflow you declare a variable and then assign it a value, whereas in Erlang you have to assign at the moment you declare it. That’s how Erlang makes variables write-once: if you can only assign a value when you declare a variable, obviously it will only be assigned once. Dataflow lets you assign to the same variable multiple times but raises an error if you assign different values, so it’s equivalent to write-once. (It uses the != operator to decide whether the values are equal.)

Interop with “Normal” Ruby

The README also says, “The nice thing is that many existing libraries/classes/methods can still be used, just avoid side-effects.”

It’s true that you can write a program that uses Dataflow for some variables but also interops with non-Dataflow code as long as that code is thread-safe. I’m not quite sure what he meant by “just avoid side-effects.”

But How Do You Assign New Values to Variables?

If a variable can only be written once, what do you do when you need to change it? Obviously programs need to deal with this. For example, what if you need to loop over an array and keep track of the index as you go? Erlang handles it by heavy use of the stack and threads, so whenever you need a new value you call a function (which spawns a thread) and the function declares a new variable, assigning the new value to it. So there’s a lot of copying of values.

In Ruby with Dataflow I imagine you would do something similar: either call a function or spawn a thread for each iteration, passing in the current value, and have the function or thread declare a new local variable which is value+1. This style of programming takes some time before it becomes natural. It’s not yet natural for me.

There could also be performance implications. Erlang’s interpreter optimizes tail recursion and converts it to an iteration (really a GOTO) under the hood so the stack doesn’t blow. I don’t know if any Ruby interpreters do that. As of a few years ago they didn’t, according to my Google search. Johannes Friestad wrote in 2005, “Recursion, tail or no tail, works just as well as any other method call in Ruby. Plenty of thrive without optimizing for tail recursion, Java is one of them. The combination of a small stack and lack of tail recursion optimization does mean that in Ruby, recursion can hardly replace every other looping construct the way it can in Lisp. You’ll be the judge of whether that is important.”

Update: Larry writes in the comments that “this library makes JRuby shine over MRI due to its green threads + native thread pool implementation.” I’ve only used MRI and I didn’t know about that aspect of JRuby but it’s pretty nice. It sounds like if you’re going to use Dataflow you might want to use it with JRuby rather than MRI.

Possible Concerns

Dataflow is really cool but I do have a few potential concerns about it:

  1. Even though Dataflow makes it easier to write thread-safe code, it doesn’t fix the fact that it’s hard to debug multithreaded code. Stepping through multithreaded code in a debugger is complicated, especially when the code switches thread context on the fly.
  2. Speaking of debugging, if the debugger tries to show you the value of a Dataflow variable that hasn’t yet been assigned, the debugger thread itself will be put to sleep. In NetBeans this means the “locals” pane stops working (but you can still debug) and if you hover the mouse over an unassigned variable, you don’t see anything in the tooltip. In rdebug it’s worse–if you eval a variable that doesn’t yet have a value, rdebug hangs because its main thread gets put to sleep.
  3. You can’t assign nil to a Dataflow variables because nil is used to indicate that it hasn’t yet been assigned. I would like to be able to assign a value of nil and have that be different from “unassigned.” This would be a pretty easy fix to make to Dataflow without bloating memory–all unassigned variables could reference the same constant:
    I removed this concern because it’s been fixed. Dataflow now differentiates between nil and unassigned.
  4. Memory overhead: Dataflow is as efficient as possible with memory usage but it does incur some overhead on each variable. Compared to unthreaded programming, it is a lot. But compared to manual thread synchronization it’s probably about the same amount of memory you would have used for synchronization data structures anyway. It depends on how you do your manual synchronization. Each variable has, in addition to its value:
    1. a Mutex
    2. an Array (initially empty) of references to Threads that are waiting for it to be initialized
    3. a Monitor condition to wake up the Threads that are waiting for it to be initialized
    4. a Boolean to track whether it has a value yet (but cleverly, this boolean doesn’t get assigned until the variable is assigned, which saves some memory)
  5. More than the overhead per variable, I wonder about the memory overhead of constantly copying values rather than reassigning them. If the stack gets too deep you could run out of memory from all the copying. (See my description of looping above.) You also make the garbage collector work pretty hard. If you loop by spawning threads instead of using recursion, you incur a lot of overhead since threads are expensive compared to function calls. This is why Erlang’s interpreter has its own threading system instead of using the one in the operating system–threads have to be as cheap as function calls. In Ruby they are not.
  6. Related to memory overhead, I wonder about the performance overhead. In addition to deep stacks and lots of threads, every time you call a method on a variable it gets routed through method_missing and Mutex.synchronize even if you have already called that method on that variable. (It does this so its method_missing override can put your thread to sleep until the variable has a value.) This could be expensive but it’s impossible to know for sure without profiling it. If it turns out to be a problem, method_missing could rewrite itself the first time after the variable gets assigned a value so all subsequent calls don’t have to be synchronized.

That reads like a pretty big list of concerns but without actually using Dataflow I can’t tell how many of them will actually cause problems. I still think it’s cool. 🙂

Try it if You Need Threading

I mentioned at the beginning that I’m cautious about using threads but there are some problems for which they are the right solution. Next time I’m confronted with such a problem on a Ruby project I will drop in the Dataflow gem and give it a try. It looks like a pretty good way to do threading in Ruby.

14 thoughts on “Dataflow: Erlang-Style Thread Safety in Ruby”

  1. Thanks for the article! I updated the library to be able to handle using nil. I also changed the internals to use monitors and condition variables to eliminate a tiny race condition that I was putting off.

    One thing to note is that this library makes JRuby shine over MRI due to its green threads + native thread pool implementation.

    Another big advantage that I’d like to point out is that the specs in the library are using threads! Normally one mocks out threads for testing, but because this model is deterministic you can keep them in.

  2. Thanks for the quick update to support nil values, Larry. And it’s interesting to hear about JRuby’s implementation of green threads + native pool. I didn’t know about that but it certainly seems ideal for a library like Dataflow, compared to all native threads.

    I’ve updated the blog post to reflect your changes and your note about JRuby.

  3. The most exciting property of dataflow variables is that they are deterministic, even though they are concurrent. They will only produce one result, regardless of the order of execution. (It is possible for them not to complete at all, or to raise an exception, but this also happens deterministically). This is the magic of unify. (I think this is true even when there’s nondeterministic input, but I have to check).

    If you want to model actual asynchronous behavior where the result depends on the order of execution, like with a client-server model, then you can use Ports in the style of the Oz language – or Actors in the style of erlang, which in this case are built on top of Ports!

    But introducing state changes makes Ruby seem more like Ruby to begin with. It ‘undoes’ the determinism of dataflow variables. You would typically implement an identical Actor using a Ruby Queue.

  4. Correction, JRuby does not implement green threads. It just implements a pool of native threads to overcome the high overhead of thread creation specially for short lived jobs.

  5. Just to add to oldmoe’s comment:

    Yes, JRuby’s threads are always native, but you can specify that the native thread constructed gets pooled behind the scenes. This helps reduce the cost of spinning up a lot of threads. We have other plans to try to reduce the cost of spinning many small threads as well.

    Ultimately, however, with or without the pool and non-trivial threaded application should see a lot better performance on JRuby, since the threads will *actually* run at the same time. Just don’t spin up a million tiny threads that add 1+1 and you’ll probably do fine.

  6. Thanks for the great article! I believe however that you have to make some small corrections on the Erlang comparisons/explanations.

    […] if you can only assign a value when you declare a variable, obviously it will only be assigned once. […]

    In Erlang, there is no variable assignment, it’s actually a pattern match that’s going on and as a result of the match, the variable gets “bounded” to the result which is fundamentally different from assignment. Secondly, Variables can be declared and not bounded, these are termed “unbounded variables”

    […] Erlang handles it by heavy use of the stack and threads, so whenever you need a new value you call a function (which spawns a thread) and the function declares a new variable, assigning the new value to it. So there’s a lot of copying of values. […]

    Erlang is stack-less (which makes their tco easier to implement).. Secondly, Erlang doesn’t create a thread when you spawn/1, it’s a lightweight process and fundamentally different from a thread. I know what you are trying to say but care must be taken since you are trying to validate a point based on false statements.

    Thanks for the info on this gem! I will definitely check it out as I feel that the only way to go forward from here on in is to make the switch to immutable state.

    Please note that I am not an Erlang expert, I just want to make sure that your article is not dismissed because of the errors.


  7. Thanks for the corrections, Ilan. You’re right about Erlang using lightweight processes rather than threads. I was aware of that and probably should have mentioned it in the article. But I didn’t know about your other points–the ones about bound and unbound variables and about Erlang being stackless.

  8. “… using a method called unify for assignment takes a little getting used to …”

    Unification is more general than assignment. Assignment requires the left hand side to be a single variable. Unification, at least in most languages that support it, allows both sides to be arbitrary patterns, where any unbound dataflow variable in one pattern can be bound to a corresponding value in the other. This is explained well in the book ‘Concepts, Techniques and Models of Computer Programming’. (I don’t know whether this particular implementation supports patterns on both sides.)

    “Speaking of debugging, if the debugger tries to show you the value of a Dataflow variable that hasn’t yet been assigned, the debugger thread itself will be put to sleep.”

    That’s unfortunate. Ideally the debugger would just show that the variable is unbound (as it does in Mozart/Oz, and most other systems that support dataflow variables).

Leave a Reply

Your email address will not be published. Required fields are marked *

Feel free to use <a>, <b>, <i>, <strong>, <em>, <strike>, <code>.

Code blocks:
[code language="ruby/javascript/html/css/sass/bash"]