Thursday, February 28, 2008

Appliances and Hybrids

In my last post, which was about the fab book Network Algorithmics, I mentioned intelligent in-network appliances. I also mentioned that "the world doesn't need to be all hardware or all software".

I believe this to be an essential point - blending rapidly improving commodity hardware with a just a touch of custom ASIC/FPGA components, glued together by a low level architecture aware systems programming approach makes unbelievably good products that can be made an produced relatively inexpensively.

These appliances and hardware are rapidly showing up everywhere - Tony Bourke of Load Balancing digest makes a similar (why choose between ASIC or Software?) post today. Read it.

Wednesday, February 27, 2008

Network Algorithmics

In late 2004 George Varghese published an amazing book on the design and implementation of networking projects: Network Algorithmics

You won't find the boilerplate "this is IP, this is TCP, this is layer blah" crapola here. What makes this special is that it describes the way the industry really builds high end switches, routers, high end server software and an ever growing array of intelligent in-network appliances. The way that is really done, is very different than presented in a classic networking text book.

The material is not algorithms, rather it is a way of looking at things and breaking down abstraction barriers through techniques like zero copy, tag hinting, lazy evaluation, etc.. It makes the point that layers are how you describe protocols - not how you want to implement them. The world doesn't have to be all hardware or all software, this helps train your mind on how to write systems that harmonize them both by taking the time to really understand the architecture.

To me, this book is a fundamental bookshelf item. You don't hear it often mentioned with the likes of Stevens, Tanenbaum, and Cormen - but amongst folks in the know, it is always there. More folks ought to know.

Friday, February 15, 2008

SLA as medians, percentiles, or averages - as told by Amazon's Dynamo

I am going to recommend another published paper because of how it talks about characterizing its own performance. That's not the point of the paper, but I found it really interesting anyhow.

The paper is Dynamo: Amazon’s Highly Available Key-value Store, by a number of good folks over at Amazon.com (including the CTO Werner Vogels). It was published in the peer reviewed SOPS '07.

The bulk of the paper is about Dynamo, which apparently is a home grown key-value storage system that has a lot of inherent replication, scalability, and reliability. A good read.

Instead of the big picture I wanted to focus on a detail. The paper rejects the notion of defining SLAs with median expectations. Instead, Dynamo uses 99.9 percentiles. (It also rejects using means, but that is pretty nonsensical anyhow, isn't it?). The central idea is that the SLA defines acceptable usage for all users - not just for half of them (the median).

This matters in real life in a very real way. There is normally one version of an operation and then an ala carte menu of choices they might layer on that.. lots of folks use the vanilla version, but if 10, 20, 30, or even 40 percent of users are using some features that require extra processing - they are totally left out of the SLA. The canonical Amazon case is a user with a very long purchasing history - an important case but not the average case. 99.9 installs a threshold for a reasonable definition of "everybody" while still leaving room for the occasional pathological case. The authors point out that for more money you can have more nines, but the law of diminishing returns certainly applies.

You have to wonder if after today's S3 outage they wish they had bought another nine or two ;) (I have no reason to think that had anything to do with Dynamo - I'm just poking fun - what Amazon has built for S3/EC2 is very impressive)

I was happily nodding along as I read the paper when this came up:
Dynamo is built for latency sensitive applications that require at least 99.9% of read and write operations to be performed within a few hundred milliseconds
My initial reaction was not charitable: Whoa - that's not really setting the bar very high, is it guys? Hundreds of millis to read and write a key/value pair? You say yourself in the introductory pages that these services are often layered on top of each other! Sure, it is more latency sensitive than an overnight data warehouse operation, but that's hardly an impressive responsiveness threshold.

But, I was too harsh. I hadn't internalized what shifting from the 50th percentile to the 99.9th really meant. The value isn't representative of what the typical user will see - it is representative of the worst you can stomach. In effect, it loses its marketing value - which is how a lot of SLAs are used in the real world.

The Dynamo paper backs this up. Figure 4 shows both the average and 99.9 percentiles for both reads and writes. average reads ran around 15ms, writes around 25. 99.9 percentiles were respectively in the 150 and 250 neighborhoods. This all makes much more sense, especially when dealing with disk drives.

In the end, I think you need more than one datapoint in order to make an effective characterization. Amazon clearly wouldn't be happy if everybody was seeing 150ms latencies - though they can stomach it if literally it is a one in a thousand kind of occurrence.

Maybe SLAs should be expressed as 4 tuples of 25/50/75/99.9 .. I've developed benchmarks that way and felt that helped keep the subsequent optimizations honest.

Even 90/99.9 is mostly what you need to keep a cap on the outliers while still getting a feel for what somebody is likely to see.