Jason Watkins’s Weblog

Disk-Based Parallel Computation

Posted in Blog by jasonwatkinspdx on March 31st, 2008

An interesting recent google tech talk put forward the idea that since 50 disks are approximately the bandwidth of a ram chip, you can look at a single rack computing cluster as if it was large smp machine, where disk is the ram, and ram is cache. If you turn your head on its side a bit this makes sense, but I also think it’s a somewhat limiting analogy: after all a 50 node cluster also has 50x the ram bandwidth due to multiple ram buses. I don’t see a reason to ignore this.

But still, there’s a lot of interesting content in the talk.

I think the most interesting idea is how using a larger but slower storage system can still provide a faster overall solution thanks to opening up a new time speed trade off, provided we have enough bandwidth.

But let’s take the premise and run with it: what can we do to make a clustered disk array look like a single large smp machine with a very large ram? The talk shows us that we can easily equal the bandwidth of a single ram bus. But what about the latency? Cooperman resorts to a lot of algorithms specifically tailored to each application to tolerate the latency. Can we do something simpler and more general?

Watching the video I immediately thought of the old Cray/Tera MTA architecture, which used a large number of hardware threads per core to hide memory latency. If a memory access takes 100 cycles, then if we run 100 threads at the same time, switching between each in turn, we maximize throughput. To each thread it’s as if it’s running on a processor with 1/100th the instruction rate and a 1 cycle ram access latency.

Could we use the same idea with a virtual machine on a cluster? Well, the the back of the envelope says that we’d need nearly 300,000 total outstanding io requests to hide the latency gap. That’s a lot. Maybe even more than we can break the problem into. But, with a disk we can access a block of words for nearly the same cost as the first byte, so if we can exploit any locality of access we should be able to bring that down. On top of that, we can use ram as a cache to avoid some hits altogether. Could this be enough to bring the total outstanding io count in reach of the couple hundred disks we might have in our cluster? It’d be interesting to see. If I were going to try, I’d look at using ZFS for it’s sorting of disk accesses.

I think on the whole I’d rather use an API like Map/Reduce, or a scripting language built upon it like Pig or Sawzall. These tools have a more limited focus than Cooperman’s group, but are likely a lot easier to hit the ground running with to get something done. It’ll be interesting to see if Cooperman’s group can come up with language tools that are more general than Map/Reduce, but still offering a simple development model.

I also can’t help but mention that one of my personal pet interests, Array Languages, could make very good use of a high bandwidth disk array.

Leave a Reply