I’m writing this blog while watching the first game of the Montreal Canadiens’ playoff hopes. There was a time when I was growing up when they were virtually undefeatable. We can only hope now. Yes, I’m a fan. Is there a relationship? Probably not, but who knows? I'm also still snickering at the maintenance outage that the hosting site for this blog had on Thursday morning. Does anyone see the irony in that?
To continue where I was last week, improving availability is typically an exponential cost function, where each 9 of availability costs substantially more than all the previous 9’s combined. So at some point you get to diminishing returns where it costs you more than it’s worth.
But what if you looked at the problem from the other direction, specifically assuming that the system will never go down, and work backwards? Wouldn’t that be unattainable and cost an infinite amount of money? If you come at it from the wrong way, starting from an unreliable system and trying to make it perfect, you’ll never get there – yes, that’s a debatable point, so go ahead and argue with me. So, start from the assumption that outages are unacceptable right from inception. There’s lots of stuff you’ll need and have to invest, but it’s actually a quantifiable cost that you can get to using traditional project budgeting techniques. Traffic routing, reliable platforms, sophisticated version control are all elements of it. Infrastructure is a huge part of the cost, as is cultural change in the operations and development groups – we’ll go there later – but is it worth it?
Here’s another cost graph where the cost of indestructibility is added to last week’s picture. Strangely, it’s a straight line. It makes no difference in the cost whether you run an indestructible solution 7x24x365.25 or 20 hours a day. So again, why bother?
In many situations these days, installation can take days, not hours. Try renormalizing a multi-terabyte-size database in your normal outage window. You can’t. It doesn’t matter whether you are at 99.9999% or 99.99%. Indestructibility means you have to have the ability to perform the renormalization while the system is up – a daunting task, but possible.
What the cost curve shows is what I call the Indestructibility Intersect Point, let’s say iIP to coin an acronym. It’s where the availability curve and the indestructibility curve meet. If you’re indestructibility investment is less than your outage cost, which is can easily be (again, you know who you are out there), why bother chasing the 9’s curve? Sometimes the iIP will be above the outage cost. That’s when you don’t bother. So the question for you to think about – who knew there would be homework in a blog – is this: Do you know what your three costs are so that you can decide what to do?
What’s coming next? Well, we’re starting to get past the introduction, so I’m going to have to write about details soon. Stay tuned and give me feedback.
Copyright © 2009 Randall S. Becker.

 
