Tolerating failure: From happy accidents to serious screwups … Time to look at getting it wrong, er, correctly
Let’s talk procedures. Plus: Are you dealing with errors in a way that leaves room for people to own up to them?
Feature This correspondent has a confession to make: I’m not perfect and sometimes things don’t go as I hoped.
I have made quite a few mistakes during the many years I’ve spent working with technology. What’s more, I see this is a good thing, and I am reassured by the fact that the famous late businessman, author and company troubleshooter Sir John Harvey-Jones has been quoted as saying “People who don’t make mistakes are no bloody good to you at all.”
Any organisation that doesn’t change is an organisation that isn’t going to be around for much longer. If we sit still, everyone around us will innovate and we will lose.
But there’s a flip side to this, which is that if we change something, we risk something going wrong. Sir John had a line for that too: “The only companies that innovate are those who believe that innovation is vital for their future.”
Before we go on, let’s understand what we mean by something going wrong, because it might not always mean something is a complete failure; it simply means it hasn’t happened as hoped or as designed. If we inadvertently use orange juice instead of lemon juice when baking a cake, this is a mistake but we may well end up with something enjoyable.
Cornflakes were invented when Will Kellogg accidentally left some wheat boiling on the stove. Viagra came about as an unexpected side-effect of an angina treatment being developed by Pfizer. The microwave oven came from a radar research scientist finding that the kit had melted his chocolate bar. Unintentional things happening is generally not all bad.
- Excuse me, what just happened? Resilience is tough when your failure is due to a 'sequence of events that was almost impossible to foresee'
- It's completely unsupportable. Yes, we mean your brand new system
- I think therefore IAM: It's not cool, it's not sexy, but it's one of the most important and difficult areas in modern IT
- I know what you're thinking: Outsource or in-source IT security? I've worked both sides, so here's my advice...
- Sitting pretty in IPv4 land? Look, you're gonna have to talk to IPv6 at some stage
- Windrush immigration papers scandal is a big fat GDPR fail for UK.gov
- Hardware has never been better, but it isn't a licence for code bloat
This correspondent can think of any number of times that I have done things — or seen things done — sub-optimally, sometimes even fairly disastrously. An internet startup, back in the days when online “communities” were considered a new and fun thing, decided to host its interactive chat server not in its data centre but in its London office, which had a non-resilient internet connection … that turned up its toes a couple of days before Christmas and prompted a 100-mile drive to open up for the telco’s engineer.
A tech company made an error when deploying new desktop PCs and enabled a ransomware attack that deleted tens of thousands of files from the main server. A client decided to save a few hundred pounds by moving his SQL Server cluster from one data centre to another by himself, and ended up with a much bigger bill. A company deploying a massive new system allowed itself to be persuaded to install over 100 new physical servers despite its infrastructure being extensively virtualized.
The vast majority of these examples have a common factor: the decision to act in a particular way was made after some level of thought, discussion and consideration.
Debugging: How did you get there?
The internet startup’s decision to host locally was based on the fact that putting the server in the data centre involved flying across the Atlantic or engaging an expensive US-based consultant (these were the days before cloud computing or even server virtualization were a glimmer of an idea).
The SQL Server issue was down to the client unplugging everything in the old data centre and then reconnecting it wrongly in the new one. The overly physical server setup was in fact a reluctant choice, but was grudgingly made because the vendor of the software being procured was adamant that they would not support it on a virtualized setup.
All of these examples were metaphorical orange cakes, though; yes, they didn’t work out as desired but the downside was modest and — most importantly — people learned something as a result. And it’s very uncommon that something we decide to do ends in abject failure; most of the time we are heading pretty much in the right direction, so if we pause to take stock — or a problem forces us to do so — the solution is usually a tweak rather than a wholesale rebuild.
So the startup decided to change … well, nothing. The telco engineer noted that the line had been provisioned poorly, yet it was still the first time in many months that there had been an issue, and the chat server wasn’t considered ultra-critical.
The guy who decided to do a self-service data centre move learned the hard way that this was a bad idea, but thanks to a Sunday afternoon call-out for the person who set up the system (me) the service was re-plumbed correctly and was up and running in time for the start of business Monday morning.
And although the company in the final example was frustrated by having a load of extra kit to manage and maintain, the service the physical machines hosted worked very well and was supported by the vendor. The “wrong” decision often doesn’t result in disaster, then.
When the procedure is the real failure
So, what about the example we have not yet returned to: the ransomware infection? Unlike the other examples, this came about thanks to a person making an error rather than through some tangible decision process. The PC deployment procedure included a step to install the anti-malware package on all machines, and the installer simply skipped that step inadvertently. The worst outcome, and yet the resulting bit of education for the firm was greater than in the three other examples put together.
First, the company learned that although it had a procedure, it was insufficient. Yes, an engineer made a mistake, but the procedure did not include any element of a “second pair of eyes” to check his work — an omission that was quickly rectified.
The PC deployment procedure included a step to install the anti-malware package on all machines, and the installer simply skipped that step inadvertently
Next, it was quickly realised that the attack was limited to a relatively small set of files (OK, it was tens of thousands, but the content of the entire file store ran into the millions of files), demonstrating that the rigour with which folder permissions had been limited under the “Principle of Least Access” had been worthwhile.
The recovery of these tens of thousands of files took the best part of a couple of days, but in addition to giving concrete proof that the backup regime had worked superbly, it also allowed the firm to learn that in the event of an incident you need to consider the eventuality that it could take a while to fix and you need to plan for the recovery team to work in shifts.
One mistake — which had zero financial impact — was an effective test of two policies and the permission allocation regime, and resulted in the improvement of two procedures. And that doesn’t sound half bad.
Clear the fear, ditch the shame
Mistakes made through negligence, laziness or ignorance are generally a bad thing. But mistakes made in good faith are usually non-disastrous, and can often have a tangible net positive value. We must therefore learn to tolerate failure and to make clear to our people that whilst striding forward we will occasionally take the odd backward step. And we need to be ready to learn from those backward steps.
As Matthew Syed puts it in his book Black Box Thinking: “[When] we are fearful of being wrong, when the desire to protect the status quo is particularly strong, mistakes can persist in plain sight almost indefinitely.”
And as Sir Ken Robinson once said in a TED talk: “What we do know is, if you’re not prepared to be wrong, you’ll never come up with anything original.” We simply can’t innovate if we are terrified of something not working out as we hope.
A final point about mistakes: most things we do turn out to be sub-optimal — and were that not the case we would not have the concept of continual improvement.
In most cases we’re not making mistakes as such, but what we think is good turns out to be less good than it can be. We’re finding gaps in our procedures, realising that something key has been missed from a test plan, discovering that smokers are using the fire escape to nip out for a cigarette as it’s more “convenient” than going via reception.
As clause 10.2 of the ISO 27001 standard puts it: “The organization shall continually improve the suitability, adequacy and effectiveness of the information security management system”. Some of this improvement will be to address things that happened that we weren’t bargaining for.
So we should not be ashamed of things not going right. The only shame from getting something wrong in good faith should be the failure to learn from it, to improve, to change the way we work, to stand in front of peers, colleagues and others and say: “We did this, it didn’t work out like we hoped, here’s what happened, here are the lessons we learned. I hope this helps you avoid making the same mistake.”
Businesses cannot stand still, then. We need to innovate, to move forward, to change things, to do things we may not have done before, and that may well result in mistakes. But not only is that not a bad thing, in the long run it is eminently desirable.
And anyway, if we do something wrong, we can call it an “orange cake” rather than a mistake. It even rhymes. ®