Jason Watkins’s Weblog

Disk-Based Parallel Computation

Posted in Blog by jasonwatkinspdx on March 31st, 2008

An interesting recent google tech talk put forward the idea that since 50 disks are approximately the bandwidth of a ram chip, you can look at a single rack computing cluster as if it was large smp machine, where disk is the ram, and ram is cache. If you turn your head on its side a bit this makes sense, but I also think it’s a somewhat limiting analogy: after all a 50 node cluster also has 50x the ram bandwidth due to multiple ram buses. I don’t see a reason to ignore this.

But still, there’s a lot of interesting content in the talk.

I think the most interesting idea is how using a larger but slower storage system can still provide a faster overall solution thanks to opening up a new time speed trade off, provided we have enough bandwidth.

But let’s take the premise and run with it: what can we do to make a clustered disk array look like a single large smp machine with a very large ram? The talk shows us that we can easily equal the bandwidth of a single ram bus. But what about the latency? Cooperman resorts to a lot of algorithms specifically tailored to each application to tolerate the latency. Can we do something simpler and more general?

Watching the video I immediately thought of the old Cray/Tera MTA architecture, which used a large number of hardware threads per core to hide memory latency. If a memory access takes 100 cycles, then if we run 100 threads at the same time, switching between each in turn, we maximize throughput. To each thread it’s as if it’s running on a processor with 1/100th the instruction rate and a 1 cycle ram access latency.

Could we use the same idea with a virtual machine on a cluster? Well, the the back of the envelope says that we’d need nearly 300,000 total outstanding io requests to hide the latency gap. That’s a lot. Maybe even more than we can break the problem into. But, with a disk we can access a block of words for nearly the same cost as the first byte, so if we can exploit any locality of access we should be able to bring that down. On top of that, we can use ram as a cache to avoid some hits altogether. Could this be enough to bring the total outstanding io count in reach of the couple hundred disks we might have in our cluster? It’d be interesting to see. If I were going to try, I’d look at using ZFS for it’s sorting of disk accesses.

I think on the whole I’d rather use an API like Map/Reduce, or a scripting language built upon it like Pig or Sawzall. These tools have a more limited focus than Cooperman’s group, but are likely a lot easier to hit the ground running with to get something done. It’ll be interesting to see if Cooperman’s group can come up with language tools that are more general than Map/Reduce, but still offering a simple development model.

I also can’t help but mention that one of my personal pet interests, Array Languages, could make very good use of a high bandwidth disk array.

Force index with active record find

Posted in Blog by jasonwatkinspdx on March 17th, 2008

A small useful tip:

Mysql’s query planner occasionally has trouble picking the right index. I’ve found that normally if you try variations long enough, eventually you find a way to do the right thing. Recently however I ran into one I couldn’t coax.

No big deal you say, after all, that’s what find_by_sql is for no? Only problem is, the query in question was inside a with_scope block, and I really didn’t want to have to find and hardcode sql finders for any variations used by callers. Thankfully, the :from option on find can rescue us: just go ahead and include the force index clause. It’s slightly gross imho, but it works, and might save you some manual sql building.

Should deployment be viral?

Posted in Blog by jasonwatkinspdx on March 5th, 2008

Reading about Storm recently, I had no idea how big it got before it’s decay. Estimates for how many hosts were in Storm at peak vary, but they’re all in the millions.

I was really struck by how sophisticated command and control for botnets has become, moving from simple IRC servers to peer to peer overlays. The authors are constantly deploying updated code: new attack approaches, email spam, blog comment spam, serving up http exploits… I imagine the application protocols themselves have to be written in such a way that different versions can co-exist in the wild without triggering faults in each other. Thanks to the p2p approach, they can do this simply and rapidly, despite the scale. Their administration mechanisms are even robust against both failure and active attempts to sever control.

Think about how different that is from the typical IT experience. Would your deployment methods still work if you had 50 million hosts to deploy to? Could your administration mechanisms withstand the determined hacking efforts of some of the internet’s smartest security practitioners? Certainly a worm is very different from a production application, but I think there’s a lot to be learned by how these crackers are their solving their problems. Why aren’t we using Name-Dropper protocols, co-existing versions and Epidemic code upgrade?