EC2 Loses its Last Limit
Up until now, frequent dumps to s3 and a slim chance of losing ~10 minutes of data were the price you had to pay for hosting databases on ec2. But now that’s gone. Once again Amazon focuses on a simple tool that can be used in a variety of ways. The only improvement I could want is for it to be a replicated block device a la MSR Boxwood or DRBD.
I think David Young’s comments comparing single provider platforms to open clusters are spot on. For some users AppEngine is very attractive. But there will always be many companies that can’t abide building their business as a vassal on the plot the lord of the manor gives them. Amazon is quickly becoming the key provider for this market, but I think we’ll see plenty of competition in the future.
Disk-Based Parallel Computation
An interesting recent google tech talk put forward the idea that since 50 disks are approximately the bandwidth of a ram chip, you can look at a single rack computing cluster as if it was large smp machine, where disk is the ram, and ram is cache. If you turn your head on its side a bit this makes sense, but I also think it’s a somewhat limiting analogy: after all a 50 node cluster also has 50x the ram bandwidth due to multiple ram buses. I don’t see a reason to ignore this.
But still, there’s a lot of interesting content in the talk.
I think the most interesting idea is how using a larger but slower storage system can still provide a faster overall solution thanks to opening up a new time speed trade off, provided we have enough bandwidth.
But let’s take the premise and run with it: what can we do to make a clustered disk array look like a single large smp machine with a very large ram? The talk shows us that we can easily equal the bandwidth of a single ram bus. But what about the latency? Cooperman resorts to a lot of algorithms specifically tailored to each application to tolerate the latency. Can we do something simpler and more general?
Watching the video I immediately thought of the old Cray/Tera MTA architecture, which used a large number of hardware threads per core to hide memory latency. If a memory access takes 100 cycles, then if we run 100 threads at the same time, switching between each in turn, we maximize throughput. To each thread it’s as if it’s running on a processor with 1/100th the instruction rate and a 1 cycle ram access latency.
Could we use the same idea with a virtual machine on a cluster? Well, the the back of the envelope says that we’d need nearly 300,000 total outstanding io requests to hide the latency gap. That’s a lot. Maybe even more than we can break the problem into. But, with a disk we can access a block of words for nearly the same cost as the first byte, so if we can exploit any locality of access we should be able to bring that down. On top of that, we can use ram as a cache to avoid some hits altogether. Could this be enough to bring the total outstanding io count in reach of the couple hundred disks we might have in our cluster? It’d be interesting to see. If I were going to try, I’d look at using ZFS for it’s sorting of disk accesses.
I think on the whole I’d rather use an API like Map/Reduce, or a scripting language built upon it like Pig or Sawzall. These tools have a more limited focus than Cooperman’s group, but are likely a lot easier to hit the ground running with to get something done. It’ll be interesting to see if Cooperman’s group can come up with language tools that are more general than Map/Reduce, but still offering a simple development model.
I also can’t help but mention that one of my personal pet interests, Array Languages, could make very good use of a high bandwidth disk array.


