Thursday, April 16, 2009

The Indestructible Intersect Point

I’m writing this blog while watching the first game of the Montreal Canadiens’ playoff hopes. There was a time when I was growing up when they were virtually undefeatable. We can only hope now. Yes, I’m a fan. Is there a relationship? Probably not, but who knows? I'm also still snickering at the maintenance outage that the hosting site for this blog had on Thursday morning. Does anyone see the irony in that?

To continue where I was last week, improving availability is typically an exponential cost function, where each 9 of availability costs substantially more than all the previous 9’s combined. So at some point you get to diminishing returns where it costs you more than it’s worth.

But what if you looked at the problem from the other direction, specifically assuming that the system will never go down, and work backwards? Wouldn’t that be unattainable and cost an infinite amount of money? If you come at it from the wrong way, starting from an unreliable system and trying to make it perfect, you’ll never get there – yes, that’s a debatable point, so go ahead and argue with me. So, start from the assumption that outages are unacceptable right from inception. There’s lots of stuff you’ll need and have to invest, but it’s actually a quantifiable cost that you can get to using traditional project budgeting techniques. Traffic routing, reliable platforms, sophisticated version control are all elements of it. Infrastructure is a huge part of the cost, as is cultural change in the operations and development groups – we’ll go there later – but is it worth it?

Here’s another cost graph where the cost of indestructibility is added to last week’s picture. Strangely, it’s a straight line. It makes no difference in the cost whether you run an indestructible solution 7x24x365.25 or 20 hours a day. So again, why bother?

In many situations these days, installation can take days, not hours. Try renormalizing a multi-terabyte-size database in your normal outage window. You can’t. It doesn’t matter whether you are at 99.9999% or 99.99%. Indestructibility means you have to have the ability to perform the renormalization while the system is up – a daunting task, but possible.

What the cost curve shows is what I call the Indestructibility Intersect Point, let’s say iIP to coin an acronym. It’s where the availability curve and the indestructibility curve meet. If you’re indestructibility investment is less than your outage cost, which is can easily be (again, you know who you are out there), why bother chasing the 9’s curve? Sometimes the iIP will be above the outage cost. That’s when you don’t bother. So the question for you to think about – who knew there would be homework in a blog – is this: Do you know what your three costs are so that you can decide what to do?

What’s coming next? Well, we’re starting to get past the introduction, so I’m going to have to write about details soon. Stay tuned and give me feedback.

Copyright © 2009 Randall S. Becker.

Thursday, April 9, 2009

Indestructibility vs. Availability

An interesting perspective came from a discussion I had recently with Richard Buckle of the Real Time View (http://itug-connection.blogspot.com/). We talked about a major difference in perspective between highly/continuously available systems and indestructible systems. In the case of indestructibility, people take the position that the system will always be running, forever, right from the beginning. It’s the core assumption. How you build your infrastructure, software, and platforms needs to keep that firmly in mind right from the initial concept. With highly available systems, people start with a trade-off of what is good enough vs. cost; whether that’s four 9’s, five 9’s, or better. The tradeoffs made during a project come with how many nines are people willing to pay for. So it gives marketing people a really interesting pitch:

We can give you a brand new system that has five 9’s of availability, and it will only cost you ___.

Sure they can, but who pays for the other changes. You know, the small stuff, like retraining everyone, change management, business process reengineering (BPR), and testing cycles – all the your mileage may vary costs that somehow are always much bigger than anyone expects and often larger than the technology outlay to get you that extra 9 in the first place. At some point, and it’s very personal for your organization, the cost of indestructibility is actually less than chasing the exponential 9’s curve when you start from a system that is fundamentally fragile.

Now, suppose you’ve got a system that is happily running along at 99.99% of the time, and somebody figured out that every minute of outage costs your company twenty million dollars in penalties (You know who you are out there), and you’ve had outages. Or worse, suppose your outages are larger than that, and you cross the critical fifteen-minutes-down-and-lose-your-charter line. In order to add another 9 to your availability numbers, you’re going to have to rework your environment, maybe change platforms, change your processes, rewrite your software, build new deployment technology, get user acceptance testing signoffs, and worse, try to find funding in the organization to make all that happen. That’s pretty daunting. The fear of having to go through that for every nine is what led me down the indestructibility path in the first place. Organizational change is far harder than technological change, but that’s often what we have to do to add that elusive and expensive additional 9.



To illustrate this point, here’s a sample graph of the risks/rewards of availability. Next time, I’ll talk more about this cost function and what it looks like in the indestructible world. You might be very surprised.

Copyright © 2009 Randall S. Becker.

Saturday, April 4, 2009

Comparison of Availabilities

I was recently asked a good question by one of the readers: “What is indestructible computing, and why should I care?” It’s a good question. Here are a few common terms. What you should keep in firmly in mind is that whatever aspect a system you look at, the actual service level you experience is usually the weakest of your components. Guess which aspect one is almost always the weakest? If you’ve been following the blog, you already know: software change.

General Purpose Computing

Well, you’re probably reading this blog from a general purpose environment. A workstation or laptop can be considered general purpose hardware. Your browser could probably be considered general purpose software. The combination of the two gives you a general purpose environment.

Highly Availability Systems

These systems are available most of the time – generally 99.99% of the time, or slightly under 5 minutes of unplanned or planned downtime a month. Banking systems are typical of these. Fitting maintenance into even a five minute window is difficult, particularly when you’re upgrading disks or restructuring your Operational Data Store (ODS).

Continuously Available Systems

These systems are available virtually all of the time – generally 99.999% of the time (about 30 seconds of down-time). Extensive use of independent components allows these systems to operate virtually without any unplanned outages. Planned outages do occur for upgrades, but the window for these outages is very small. There’s a lot of confusion between Highly Available and Continuously Available systems, the lines are pretty blurry, and I won’t really differentiate between them, much. That there is even a distinction is arguable.

Critical Systems

These systems include some of the obvious life-critical systems: flight control systems; rockets; many health monitor devices. Systems like this do not have the same level of long-term availability that continuously available systems have, but during their duty cycle, no outage is permitted at all. Fortunately, no changes are generally permitted while the systems are up. How many launches were delayed because of sensor or software issues?

Long-Life Systems

In long-life systems, reliability is the number one priority. Unscheduled maintenance is usually impossible or cost-prohibitive. Scheduled maintenance is possible but not desirable, and usually involves only software components. During maintenance, rigorous testing is done to ensure that the system will function reliably when back online. Communication satellites and the Mars Explorers fall into this category. Even then, subtle defects, like miles vs. kilometres per hour in a calculation, can cause disastrous failures.

Indestructible Systems

A truly indestructible system builds on the best of all of these systems. The systems are expected to be long-life, yet dynamic. Change is not only possible, but expected. Yet, there are no unplanned outages and no planned outages. Not only small components, but major components like data centers can go offline without a perceived outage or noticeable reduction is service levels. Maintenance is done while the system is up.

And I don’t blame anyone for thinking indestructibility is unattainable. It’s very hard to get right and even then, it’s always possible that something will go wrong. In future posts I’ll go into what it takes to make this work. Hopefully you’ll see that indestructible systems are practical in the real world and understand what it takes to make them work for you.

The next post will go into the starting points of view for building these systems and how money gets wrapped up in it.

Copyright © 2009 Randall S. Becker.

Wednesday, April 1, 2009

Blog Schedule

Hi Everyone. I've already had a few requests for topics, subjects, and information; so don't worry, I'll get there. One important topic in a blog is scheduling. I'm going to try to post a new blog about once a week (at least), usually some time on the weekend. So don't get too far behind. And, if I get too far behind, don't hesitate to nag me.
Indestructibility is an important topic. It starts from fundamental position that systems should always be available instead of looking at how many 9's you can stick on the end of a marketing statistic. Let's keep our systems (and discussions) running forever. It's not only about hardware or operating systems, although that's part of it. It's more about people, processes, and how you manage your critical business applications. And, while I have my preferred platforms, I'm going to try to keep things fairly platform-neutral.