Monitoring is simple enough – green means everything's fine. But getting to that point can be a whole other ball game
Don’t take no for an answer, but be prepared to give it.
Feature Monitoring seems easy in principle. There is nothing particularly complex about the software or the protocols it uses to interrogate systems and deliver alerts, nor is deciding what to monitor or the act of setting up your chosen product.
Yet although it's common to find monitoring done pretty well, it's very rare to find it being done brilliantly.
The basic thing you want from your monitoring is to show that, when everything is working within tolerance, all indicators are green. This sounds obvious, but it's far from simple to achieve.
Let's look at a scenario in a company that this correspondent worked with a few years ago which had a truly impressive monitoring regime.
The network manager wanted to carry out some maintenance overnight, which would cause an outage on the global WAN links into the main data centre of a 1200-person company with a dozen sites in four continents and a core setup with north of 1000 virtual servers. He emailed the overnight monitoring team to warn that some alarms would happen, powered down a handful of key systems, spent 90 minutes making the required changes, powered everything up again, watched until everything on the management console went green, emailed the overnight team to say "all done" and went home.
If you can go home at 1am, confident that the green indicators mean you don't have to test anything else, and the Delhi office starts work in an hour or two when you're just getting to sleep, it's fair to say that the monitoring has been done well.
So how do you do it?
First, you need to recognise that, while change is essential in any company, it is the enemy of the monitoring regime. Ninety-nine-point-nine per cent coverage in your monitoring is simply not good enough — it is absolutely essential that any new device or system connected to the infrastructure is added to the monitoring regime. Miss one system and your monitoring infrastructure is immediately invalid. So before you even contemplate putting significant effort into getting everything onto the platform, develop and test your change policy and process for adding new systems and removing old ones. If you don't have the process absolutely nailed ready for when everything is monitored, you'll never be able to maintain 100 per cent coverage for systems.
One of the easiest things to underestimate is the effort needed to get everything into the monitoring platform. In our example it took one person, dedicated to the task, a little over a year — and he was most definitely not lazy. Over 1000 servers, a dozen sites' worth of network kit, internet routers, firewalls, Wi-Fi access points, phone systems — the numbers add up. And if you don't design the monitoring properly before you start configuring it — protocols, SNMP credentials, alerting levels, alerting mechanisms, and so on — you'll simply end up having to retro-fit things you forgot to dozens or even hundreds of devices.
As you add systems to the monitoring regime, you need more states than just "on" and "off" — that is, you can't just go from something not being monitored to it being monitored, because the inevitable result is that something will turn red at some point when you're midway through configuring the item you just added. New items need to wink into a state in which they exist and are visible (if they're not there, you can't configure them) but don't contribute to status indicators or alerts. Once each item is completely configured and appears to be operating correctly, you can flip it to "live" mode.
The next reason that this particular company made such a good job of its monitoring might be considered controversial by some: a lack of democracy. It is hard to imagine anyone with any IT experience who hasn't come across several instances of people or departments wanting an exception: Legal say the data on their document management system contains material that is too confidential to risk letting the monitoring agent interrogate it; the developers don't want their kit monitored, for vague and undisclosed reasons of "sensitivity" and "confidentiality"; the app support team insist that <insert application name here> doesn't need us to monitor it because the company that supports it already monitors it; the R&D team don't want to give the IT ops people access to their proprietary systems to hook in the monitoring calls.
- Excuse me, what just happened? Resilience is tough when your failure is due to a 'sequence of events that was almost impossible to foresee'
- No change control? Without suitable planning, a change can be as good as an arrest
- It's completely unsupportable. Yes, we mean your brand new system
- I think therefore IAM: It's not cool, it's not sexy, but it's one of the most important and difficult areas in modern IT
The answer has to be a simple "no". The story your monitoring console gives you must be unequivocal, and the moment you succumb to someone begging for an exception is the moment it all goes south.
The favourite companies for which this correspondent has worked could all be placed in the category of "benevolent dictatorship". The senior management (often the owner) would have an idea, ask his or her trusted lieutenants and techies for suggestions, usually tweak the idea based on what people have said, then tell you: "Right, get on with it and tell me if anyone tries to get in the way." If you're trying to do monitoring properly and you're told you need to write papers and persuade people to give their buy-in, try to think of something else to do that is less painful than banging your head on a wall.
Another reason that this particular company succeeded was that despite having an extensive infrastructure, it had a very talented, centralised infrastructure team. While there were IT staff — sometimes several of them — in each of the worldwide offices, the network, server, storage and monitoring systems were all run centrally. It was truly a joy to work in a way where one could get on with it without constantly having to co-ordinate remote teams to configure credentials on servers or routers, or to chase them up when other jobs took priority. Distributed support and distributed teams don't preclude success, of course — you simply need to accept that there will be a much greater co-ordination effort in that scenario.
How long is a piece of string?
Probably the second most tricky part of getting your monitoring right — behind achieving the "golden 100 per cent" — is deciding how deep you need to go with monitoring. The obvious starting point is system availability and performance — up/down status, disk capacity, memory usage, network link utilisation, LAN port errors — and these will form the default starting point. The precise extent to which you monitor is, however, very much a "How long is a piece of string?" question. And the right answer to "How far do we go?" is unique in every case.
In reality, you go up the layers as far as you need to in order to be happy that "everything green" equals "all is well". So, for your main web server, for example, you won't stop at "Can I ping it and is the LAN interface passing traffic?" You'll have synthetic transactions to test the web layer and the database behind it, preferably with remote agents out on the internet somewhere testing that all layers are working correctly from afar and not just internally. Of course, doing extensive tests of this form can cause the console to take many minutes to finally go all-green, but this is far preferable than having to do manual tests or to skimp on monitoring in favour of time. And in the majority of instances the console will be a steady progression of red blobs becoming green one by one, rather than waiting 15 minutes with nothing happening and then (hopefully) BAM! — a sea of success.
There is one final thing you need to do in order to rely on your monitoring, and it's something that one so often sees being overlooked or at least done poorly: testing. Most of us just don't test our monitoring properly.
Why? Simple: testing it properly means breaking things. If you want to prove conclusively that the complex monitoring of your CRM system is configured correctly, the only absolutely certain way to do so is to cause the system to fail in as many ways as you can think of. Yes, one can implement artificial ways of emulating a failure, but there can still be that nagging doubt as to whether they really are absolutely accurate and whether a real failure will look the same to the monitoring agent.
If it's not realistic to ask for extra downtime on core systems, though, you can be a little canny about it and work with the owners of those systems to persuade them to let you do your monitoring testing the next time they have a period of planned downtime. It's an excellent second-best approach because you get to try scenarios for real but nobody has to beg the business for more downtime than they want — or impose upon the consumers if it's a customer-facing system.
Monitoring, then, needs rigour and determination, completeness and testing. But when you get it how it needs to be, and when you see it stay how it needs to be, it'll change your attitude completely. ®