Friday, February 15, 2008

Plan For Failure

Vista "enhancements" include removing the ability to do repair installs. Screw Windows up a little too badly, and your only option is to reformat and reinstall.

Rim has an undisclosed problem with servers off in Canada, and suddenly every Blackberry everywhere goes offline.

Congress starts ramping up surveillance and blanket data retention, but never seems to worry about the fact that those same tools are equally useful for criminals.

What do these three disparate events all have in common? Simple. All of the design was built around what happens when things go right, not wrong. All three cases display a horrific lack of preemptive failure analysis.

Failure analysis is something that is taught to more established professions, such as mechanical or civil engineering. In these professions, where a screw up frequently can mean people die, worrying about when happens when - not if - something breaks is beaten into students until they think about it the way a deep sea diver thinks about his air supply.

When a civil engineer designs a bridge, he can easily end up putting in thousands of pieces. Some pieces, when they fail, are rather unimportant. If the dedication plaque rusts or falls off, a donor may be upset, but the operation of the bridge isn't compromised. On the other hand, if a rivet or weld holding a support in place cracks, then the engineer who signed off on the design is going to be very interested in what will happen. Will the bridge hold for a year? Six months? A day?

Every part has an MTBF. Just as important as knowing when that part is likely to fail is every bit as important as knowing what will happen when it does fail. Often times, an early analysis can find hidden critical dependencies that can be fixed or mitigated with simple design changes.

Take the Vista removal of recovery restores. Strictly speaking, removing this feature didn't add any failure modes. Unlike a new driver or filesystem, it didn't add any new ways for an existing Windows system to break. What it does, is ensure that once a failure beyond a threshold does happen, the impact will go from being recoverable, to being a death sentence for that copy of Windows. Without adding any new failure modes, the number of critical failures just went up.

Now if you ask the people who put these systems together, I highly doubt that they intended for these systems to fail. This seems obvious... But it's also the problem.

Every system out there will have a failure sooner or later. Let's be fair to Microsoft, by giving them a plus side. All Blackberries have their data go through Rim servers, despite having a perfectly good data connection from the cell provider. This adds a wonderful single point of failure. By contrast, Microsoft based smart phones don't need any such assistance. They're perfectly capable of talking on their own, without an extra translator.

Microsoft could take their entire infrastructure offline, and the phones wouldn't care. By keeping their own servers out of the data path, they've reduced the number of failure modes of Windows Mobile phones out in the wild.

If we programmers and IT guys want to be taken seriously, we absolutely have to start planning for failure. Throwing redundant servers at problems reduces the likelihood of failure, but doesn't reduce it to zero. RAID protects you against a single hard drive failure, but not multiples.

We have to start asking ourselves, with each and every component we build or install, what will happen when this system breaks? That's how you notice things like a pair of high end servers both plugged into the same $4.95 ValuePak power strip. That's how you put in exception handlers that, when that exception that can't possibly happen happens, at least ensure the program goes down gracefully instead of exploding with a corrupted database.

That's how we can start building systems where a single, simple stupid failure doesn't turn into a headline generating, career limiting fiasco. Then maybe those civil and ME guys will stop snickering whenever one of us calls himself a software "engineer".

No comments: