This article is more than 1 year old

Wait... who broke that? Things you need to do to make your world diagnosable

Crisis management: Audit, access – the whole shebang

You only ever discover the inadequacy of your system management, monitoring and diagnosis tools when something goes wrong and there's a gulf between what you want to do and what you need to do. Here are 10 things you can do to maximise your chances of diagnosing the problem when the brown stuff hits the ventilator.

Ladies and gentlemen, synchronize your watches

The most important thing you can possibly do is have an authoritative time source and ensure all your infrastructure components set their clocks from it. If systems have different times on their internal clocks, whatever logs you have available will be almost impossible to collate either manually or using security information and event management (SIEM) tools. Note that I'm not saying that the clocks actually have to be right to the nearest nanosecond relative to GMT – just that they need to be identical to each other. But use an authoritative time source anyway, as having everything set to the right time is also useful. And just to reflect on that mention of GMT: have everything set to the same time zone regardless of where it is – again, it'll avoid you having to do mental addition and subtraction when the time comes.

Audit logging

You need to be able to see who took what action, and when they did it. It's crucial, because who's to say they won't do it again? If the employees start to moan that Big Brother is watching them, tell them, as politely as you see fit, why you are doing this.

I have come across relatively few problems in my time that were malicious and deliberate, but loads of instances where users did something wrong unwittingly – in which case I've been able to instruct or train them and/or clarify documentation. Also, remember that it's often not a person who was the last to touch something before it broke: it's often a script, and the audit log will give you a pointer that helps you find which one.

Log levels

Logs take shedloads of disk space, and are a paradox: when all is well you want to turn it down to report only emergency problems, but if something breaks you want debug-level logs from ten minutes ago. Think hard about every log and set its level appropriately; use log rotation judiciously; and set aside plenty of disk space (it generally doesn't have to be super-fast disk) dedicated to logging so you don't kill your world by filling core volumes with logs.

Out-of-band management

If you're using an enterprise-level server and you don't buy the out-of-band management module, you're a numpty. In my past life looking after a global enterprise's network my servers-and-storage peer and I would always ensure we had full out-of-band console control for all our kit, even to the extent of a dial-up modem connection into a terminal server and KVM unit to cater for WAN and VPN failures. It cost us probably a couple of thousand quid per site and it saved our bacon countless times.

Access levels

I've seen it so often: the out-of-hours guy gets called, can't fix the problem, escalates it to the second-line guy … and his privileges don't let him in with adequate permissions either to look at the problem or to fix it. Use role-based permissions and make sure the support teams have the right profiles – and test them frequently.

Current credentials

Which brings us onto currency of credentials. If you're the tenth on the call-out list you've probably not been called for months: then when the phone does ring you find your account on the system you're trying to fix has expired. Sometimes you'll be lucky and the system will say: “Your account has expired: click here to change your password”; sometimes you won't and it'll say: “Your account has expired: contact your system administrator”. Again, check your credentials regularly so you know they'll work when you need them.

Access to top-secret passwords

It's common to have the God-level password for core systems unknown to anyone. Sounds a bit daft, I know, but a good way to stay secure is: (a) write a ridiculously complex password on a piece of paper; (b) set the top-level password to that string; and (c) seal it in an envelope and lock it in a safe. If it's sufficiently complex and odd, the person who wrote it won't be able to remember it, which means you always have a last-ditch means of access. But make sure you have the right processes in place for rousing those who have access to the safe when you need the password.

Support team contacts

You don't know everything about everything, so keep your contact list and rota up to date and accessible. Make sure the company's starters and leavers process is integrated with this list (I wish I had a fiver for every call-out list I've seen that contained people who had left and/or died). Keep the master on a shared server, and have an automated synch on your laptop that keeps your local version up to date. Why the latter? Because you're stuffed if you need to call someone and the master list is on the failed system. (And yes, I've been bitten by that one).

Business Crisis Management

What does BCM have to do with diagnosing faults? In a word: time. If something has gone really pear-shaped, you need to be able to focus on figuring out what's wrong and getting it fixed. With a well-structured BCM regime you can invoke the necessary BCM call-outs and let them get on with all the ancillary stuff like informing customers/press/suppliers and helping supply you with more brains and hands to help out; you can therefore get on with the bit that most needs your skills – diagnosing the issue.

Lessons learned

When you've figured out the problem and fixed the underlying cause, always set aside time to consider (and preferably discuss with others) how and why the problem happened. Reflect on what you did, and what (if anything) could have been done better. Apply these considerations to your documentation, processes, procedures and training. After all, you can spend your life thinking “what if X went wrong” and catering for those ideas in your processes, but you don't get a better “what if” than a real-life failure. ®

More about

More about

More about


Send us news

Other stories you might like