Sunday, March 29, 2009

Perceptions on Destruction

So, my system blew up. Ok, not really, but what does that mean to you? It means a lot of different things to me. Here are a few things that have gone wrong with my computers recently:

  • A monitor refused to power on at start-up.
  • A patch came in that caused a test system to refuse to boot.
  • A Software license key got corrupt – you’ve seen this one last week.
  • An external battery melted into a puddle of smoking silicon, plastic, cadmium, and goo.
  • A laptop wouldn’t even power up, but I had complete backups.

There are more, but I think you get the point: I’ve had my share of interesting experiences. So which one caused me the most grief (and embarrassment)? Well, I have many monitors, so losing one is never an issue. External batteries are interesting devices, but it only cost me some time on the plane that I used for some much needed rest anyway. The software vendor with the license key issue responded quickly, so, while I lost six hours, it was overnight and I should have been sleeping. So what bothers me the most?

From a convenience standpoint, there’s no question that losing my laptop was the worst. It took me six weeks to get back to full speed. However, the laptop was at the end of its life. I had a backup and lost virtually nothing. However, I couldn’t buy a comparable laptop that had the version of the operating system and productivity software I wanted on it. So reconstructing my environment was painful. But they key part of it was that I didn’t slow down at all for my customers. Not a single document or presentation revision was lost nor any e-mails nor anything else of importance.

From an operational standpoint, my company took a hit because one of our servers refused to boot. We had to run without our core document repository for almost a day. There was no issue in the hardware though. The problem was that an operating system patch came down automatically, but the BIOS couldn’t handle it. The hardware vendor issued a BIOS patch after the operating system patch came down and the problem was resolved. So who’s to blame?

In a word: Me. It’s always my fault. Hey, when you run a company, it’s your fault. Take the blame and move on. I’ve since made sure to disabled automatic patch installation on all my servers, and they now go through quarantine prior to installation on business critical systems. The operating system vendor has to take some of this for not testing their patch adequately on very common BIOS releases where their operating system is installed as OEM software. And it’s also the hardware vendor’s fault for not issuing the warning to their customers in advance of the issue. I can’t believe that we were the first company to encounter the problem – although, as some of you readers already know, it’s happened before.

So what did I learn? There was egg on my face, of course. My staff was pretty upset that the server was down, even if only for a short period. My customers didn’t notice, because it was an internal server, fortunately. But, mostly, that it’s not the hardware. The physical machine had no problems, although there were lots of amber warning lights flashing at me. It’s not the individual software components, because they all worked to specification, according to the vendors anyways. It was the inter-relationship between changing software components. And you know what? That’s usually what burns you. A few years ago, I gave a keynote where I asked the audience to choose which represented the most likely risk to their organization from a set of disaster scenarios, from an asteroid, nuclear war, floods, mould, and a picture of my (then toddler) son at a keyboard. I have to give the group (one of the HP NonStop user groups) credit for getting the answer right – my son. Changing software represents something far different than a disaster scenario. It is a quantifiable, known, and expected risk. Each time you change your system, you are putting your company, your customers, and your stakeholders at risk, because something may break. On the other side of the coin, not changing your system also puts the same group at risk of no longer being competitive. So what do you do?

It’s simply not an option to stand still. Could we surf the web if we all were using bicycle-chain driven computers or upgraded looms? I don’t know about you, but I can’t blog on a loom. My cats can blog on a rug, but that’s a different story and very messy. So change is something we want and have to embrace. Dealing with the risk-reward of moving forward is pervasive in technology. We can’t ignore it, so we have to change and put in new releases. The questions are where to find a balance and how to do it safely.

Stay tuned for upcoming entries where I’ll talk about this balance. The next blog will go into a comparison of different levels of expectation of reliability.

Copyright © 2009 Randall S. Becker.

Thursday, March 26, 2009

Running with a Security Blanket

I remember, back when I was two, I had a little blue blanket. It had a blue shiny border and made me feel warm, comfortable, and secure. I cried when my mom washed it because somehow it was different when it came back. It smelled nice and clean, but it just wasn’t the same somehow. By now, you’re probably wondering why I’m bringing this up. So am I, actually, but I’ll get there.

This blog is coming at you directly from the HP NonStop Security SIG in Canada. A number of vendors showed up today to present their capabilities and perspectives. It was a pretty good event. Topics included: PCI Compliance, Sarbanes-Oxley (SOX) Compliance, Kerberos, single logon, various protocols, emulations, integration points, auditing and reporting. OK, so why am I rambling on about security on an Indestructible Computing blog?

Security is rarely considered in the Indestructible Computing domain. Yet security breaches definitely contribute to outages, particularly when the criminal is bent on malicious damage rather than data access. Fortunately, of all the breaches, this kind is not that common. A bigger concern these days for security is protecting data from prying eyes. But come to think of it, if you get audited and get shut down because you’re too vulnerable, that’s a pretty big problem for your customers.

But if you look at indestructibility, security and authentication can play in other ways that are not obvious, but annoyingly interrupting. Suppose a customer logs onto your banking application using their card number and password and then changes their password. If everything goes right, all the servers happily running in the data centre pick up the credential changes and are able to service the customer’s requests for balances, transfers, and other inquiries. But what if one system is down for maintenance – it happens? The password update isn’t picked up by that system immediately, but that should be OK. The user gets a note that some part of the system they don’t care about isn’t available, and they carry on happily. An hour later, the system and user come back in. The user now wants to use resources on the system that was down, but the batch job that updates passwords from the master password server hasn’t run yet. Now you have an unhappy user who has to call customer support. That costs you money for the support agent and credibility to the customer. So the important part of this equation is that password and credential management must have no latency. It may even be that the servers that process your credentials are right up there with the more critical parts of your service offering, because they are customer facing.

But that’s not all. Current audit requirements mean that so much logging is going on that companies need increasing amounts of disk every year. We’re even hearing that we’re going to have to eventually keep records of all traffic going through our routers. Who makes this stuff up, disk drive manufacturers? We’re projecting a need for terabytes of storage just for security and audit compliance. And, the rub is that if you run out of disk, your application cannot process transactions or even inquiries. You actually have to shut down until you can start logging again. Now that is not indestructible, is it?

Random rant: In an effort to keep things politically correct to reduce HR vulnerabilities and access to bad sites, some companies are putting in activity loggers as part of their security and audit infrastructures. These are going on your laptops and workstations! The concept is great for catching slackers and indiscriminate porn surfers. The problem is that some of these tools can also capture credit card information, passwords, and other identifying information. Who is securing the HR department? What if their tracking data is hacked?

Security is like a warm blanket. You can wrap your systems up in it and feel all nice and comfortable. But some hackers might want to take away your blanket or poke at you through it from places you can’t see. More importantly, you can’t hide from your customers under it. And unlike some superheros' capes, blankets are not indestructible.

Copyright © 2009 Randall S. Becker.

Saturday, March 21, 2009

Sidebar - Software Licenses Impacting Indestructibility?

This is a special news flash blog entry from a real life, so do not interrupt your set. We'll return to our regularly scheduled blog shortly.

My company has established some pretty good controls and redundancies for handling a variety of scenarios relating to software and hardware failure. For example, my book is backed up on a RAID drive on a server and has a redundant copy on my laptop in case the ceiling falls in on my server. The server is under a main support beam in the building and nowhere near a water supply, so it's even somewhat protected from an earthquake. While flooding is a possibility, the rest of Toronto would be lost first, so I think it's an acceptable risk. Anyway, to the point. Tonight, a Friday, all of a sudden after 14 months of working properly one of our primary third party software publishing products stopped working because it hit a "genuine version violation". I'll leave you to guess who made that product. Anyway, I can't install the product on another machine because it already hit the violation and wouldn't be able to be activated. I should mention that we have a very strong anti-piracy policy and I've got the original software media on my desk beside me. So now we’re down because we’re unable to use a key resource of our company. The response from the vendor is that the situation will be resolved within 1 business day, which puts it sometime at the end of Monday. The rating assigned by the vendor was “Minimum business impact”. Ha!

So how does this relate to indestructibility? Well, it shouldn’t, but it does. Because a key service is no longer available, our business is interrupted. It wasn’t because of a process issue or a procedure issue in our company. Nor was it a hardware or software failure. It was a flaw something the vendor of our document preparation software did or did not do properly, their assessment of the severity of the issue, and their responsive times – all of which are outside our control.

Vulnerabilities to your ability to deal with failures come from all over the place. Sometimes they’re in your control. Sometimes, like tonight, they’re not. And it’s extremely frustrating and in this case embarrassing. But mostly it's because of the unplanned and unacceptable outage for an unreasonable amount of time.

More to come on perceptions soon. Come to think of it, this is a partly a perception issue, isn’t it? A difference in the perceived importance of a service from a client’s point of view compared with a vendor’s.

Copyright © 2009 Randall S. Becker.

Tuesday, March 17, 2009

Welcome to the Indestructible Computing Blog

The Indestructible Computing Blog is my attempt at framing what Indestructible Computing is and is not, and establishing a solid dialog on the subject in the hopes of raising the expectations we have for our friend the computer. In our daily experience with computers, we see spam, viruses, slow web pages, crashes, spinning clocks, and other minor annoyances that really don’t help our perception of what computers can do. What we don’t see is the infrastructure that quietly runs in the background, making sure that our money is moving around correctly without prying eyes, running our power plants, giving us the security of knowing that we can pick up the phone and call 9-1-1, and letting us go to the grocery store and pay for food with confidence that the computers will be up, even if our credit is maxed-out. But why are the two types of experiences so different?

Much comes down to expectation. We expect our computers at home to misbehave, but we are intolerant of retailers who lose our online shopping carts after an hour of our latest buying spree. After all, the commodity computers we purchase at our large electronic stores are throw-away, right? But how are they different than the commodity computers our banks and phone companies use? They’re not, actually. Advances in the quality of hardware have benefitted everyone alike. So why do we expect our computers at home to stop every so often and become enraged when our banks web sites are down for a few scheduled moments?

I remember one incident at a border crossing, where an immigration agent asked me what I did. My response was that I help companies design systems that will run for twenty to thirty years. She was incensed at the idea and told me that that was impossible. That was an epiphany for me. In a few words, almost two decades of frustration at trying to convey the concept of indestructibility was explained and left me feeling like a pile of broken glass. Perception of the unreliability of computers has become so ingrained in our culture that people simply don’t believe systems could be built to withstand disasters, yet only when the systems are visible. Infrastructure, however, isn’t perceived to be a “computer”, so it, whatever it is, supports our society and had better be always there.

Stay tuned for the next entry where I’ll explore this perception further.

Copyright © 2009 Randall S. Becker.