Building disaster into the network: how UK.gov does IT
Contracting out the blame...
Analysis The Department of Work & Pensions may ultimately derive some lessons from its IT disaster last week, but it's doubtful that it will spot the most important one, far less take the necessary corrective action. The problem, you see, is that its IT strategy is now largely out of its control, and that it is essentially deskilling itself as far as IT strategy is concerned, and abdicating responsibility for the consequences. This has particular relevance across the Government, because on the one hand the overall budget depends on IT working sufficiently to achieve savings and staff reductions, while on the other IT is being outsourced to the private sector as a matter of policy.
The strategy depends on the heavy investment in IT paying off, that success lies in the hands of outside companies, and last week's IT disaster could not have, should not have happened if the standard disciplines and safeguards of network management had been in place. The inquisition will presumably establish whose hand pressed the self-destruct button, but that is not really what's important - the button should not have been available to be pressed, and the DWP should have been in the position to know and understand why the button could not have been pressed. IT professionals looking at what happened must however surely conclude that the DWP effectively did not have anything in place that could be dignified by the term network management, and that if it even knew what the expression means it must have placed its faith entirely in the contractor, EDS.
With any justification? Well... The Register has been leaked sufficient information for us to have built up a reasonable picture of the disaster's genesis, and it's very simple. A small group of machines at the DWP was running a pilot XP installation in preparation for rolling out XP across the whole organisation. An update that was intended to be deployed only to this group of machines was, because of a bad or corrupt Group Policy, deployed to the entire network as well. According to one DWP source, this changed local policy in such a way that clients were denied network access. It wasn't immediately clear to the centre what had happened, and even when it was fixing the issue without the presence of a network was complex.
The DWP has somewhere between 80-100,000 PCs, and admitted to 80 per cent having been hit by the problem, which extended from Monday into Friday afternoon, when the DWP said things were getting back to normal. The DWP argues that the matter has been much overblown, and that there was little disruption to benefits payments; in which case, one is tempted to argue, why do you need getting on for 100,000 XP machines when you can manage so well without them? But although it is difficult to see how EDS, the main IT service provider for the DWP with according to one staffer "full access and control for all of our key systems and complete monopoly where decisions are made", could not have been in control of the finger on the button, we really do need to take a closer look at that button, before it can switch off the whole country.
It is, as Lib Dem MP Richard Allan commented in his blog on Saturday, no more than good practice to roll out updates to networks in small batches in order to minimise the damage that could be done if something went wrong. This was not of course an intentional rollout of an update to the entire network, but that simply makes it worse because, if a group policy action was not intended to apply to the whole network, then what on earth was it doing being pointed at it? And how is it possible to do something like that by accident? (Or, come to it, at all?)
Good IT practice surely dictates that the pilot XP system should have been thoroughly walled off from the operational network, and that the accounts and policies used to manage it should have been quite distinct from those of the main network. As this manifestly could not have been the case, we are left with virtually the sole conclusion that the DWP's network policies are badly broken, and quite possibly being operated by the class of IT administrator who thinks he's god.
Which is OK, but wise IT admins who think they're god make sure they've got a few failsafe policies in place so that they can't destroy the whole network, company, department or Government if they accidentally push the wrong button. One of the ways you can do this is to construct rational and limited-sized groups for different roles and purposes, and another is for you to not be driving the network while you have vastly more admin privileges than you need for the particular jobs you're doing. Is this basic network management training? Yes, we fear it is.
Meanwhile, the DWP is probably deriving precisely the wrong lesson from the fact that its operations were, at least after a fashion, able to continue during the crisis. Having the PCs disabled meant that new cases and amended cases couldn't be processed, but existing payments could be made, so an outage for a couple of days needn't have a long term impact. Or so they think. The DWP is not however where it intends to be with its PC network yet. It currently has under construction something it terms its Desktop Office Infrastructure (DOI). According to in-house techies this is proving a major headache to implement, but in principle it's intended to be The Way Forward. The use of networked PCs is intended to produce efficiencies and savings by allowing the DWP to cut staff - but if this goes ahead and it's still possible for a single finger to destroy the PC network, what kind of damage would the DWP's operations then sustain? As you push more of your critical assets out onto the things most likely to break, you should be taking a great deal of care to construct systems to stop it breaking, and to build resilience into the network.
The point to bear in mind here is that although we've been hearing about the joys of sharing information on PC networks and the wonders of single console network administration since the ark, we tend to forget that it's something that's only just starting to arrive on a large scale, particularly in government. Sure, for years it's been possible for imbeciles in charge of network configurations that are too dumb to live to break little bits, but it's only just becoming possible for them to paralyse whole departments, across the country. The UK Government actually has quite a lot of these networks going into place right now, but is smitten by the brochureware on the joys of centralised network management. It does not grasp (this isn't in the brochures) that if you don't do the management properly then you're building a self-destruct capability in, and for so long as it concentrates on haggling over prices and delivery without having the capability to understand the nuts and bolts, it won't be able to grasp how dangerous the weapons it's building are.
By now the DWP and the rest of the Government's departments are beginning to categorise last week's crisis as just one of those things, and an isolated problem that has been overcome. That's if they're still thinking about it at all. But at the very least they should be asking, could it happen again, could it happen to me? Under what circumstances if any, ministers should be asking, would it be possible for any single person in the organisation to roll out an executable, a patch, a change in settings, anything, to the whole network at once? And if that's not theoretically possible, what safeguards are in place to ensure that it really isn't possible, and that it can't become possible either through error or deliberate sabotage? Most of them won't know the answer, nor will they readily be able to find out what it is, and if told it's all perfectly secure by their consultants or contractors, they'll happily believe that.
Ignorance oughtn't to be an excuse, but these days, regrettably, it often is. Outsourcing means you always have somebody to blame, and so what if you're so ignorant of the technology and the issues that you're utterly incapable of specifying a project properly (and thus, utterly incapable of having an IT project work)? Never mind, you can always blame the contractor when it goes wrong - but it's your network, your department, and sooner or later that flock of chickens is going to come home to roost. ®