On Call Welcome to Friday. The weekend is almost upon us so put down that bacon sarnie and pick up today's On Call, The Register's weekly column of tales from the tipping point.
Today's story comes from "Jason", who spent some quality time working as a local sysadmin "at one of the larger bases in Iraq" towards the end of the last decade.
Jason was naturally on call pretty much the whole time, being a contractor at the base, and was present when the base's Tech Control Facility (TCF) was undergoing a monthly power system transfer test.
For a bit of context, the TCF was responsible for the base's network, communications, Non-classified Internet Protocol Router Network (NIPR) and the super-hush Secret Internet Protocol Router Network (SIPR). NIPR was used for unclassified information sharing and access, while SIPR was the Department of Defense's version where more classified data could be flung.
I don't have to save my work, it's in The Cloud. But Microsoft really must fix this files issueREAD MORE
"Our setup had fully redundant line connections going to two separate base generator farms (what everything ran off) as well as our very own diesel generator rated to run everything needed."
There was also battery backup to plug any gaps – a fact that will shortly become very significant.
The tests had been performed many times without incident, and the electrical contractors would normally isolate the TCF from base power and fire up the generator for a half-hour test run.
This particular test started around lunchtime so Jason pottered off to the base's Subway "to get my turkey bacon sub (hold the bacon)".
When he returned, things were not going well. "The generator wasn't running as it should have been and the electricians were gathered around the switching controls for the power along with some of the tech control folks."
The gang were also unable to switch back to the base's power and so things were running off batteries.
Now, about those batteries. "I would love to say I had supreme confidence in our battery setup but in reality it was just a covered storage area outside the TCF where the rows of car batteries could bake in the 120+° [Fahrenheit/ ~48˚C] heat during the summers. Since I had been there we had never fully tested the capability of it."
Well, Jason, today's the day!
The unknown quantity of the batteries aside, Jason's other issue was that the room was getting warmer since the limited power was solely directed on keeping the servers running. He told us: "The servers were racked in a very improper way (before I arrived of course), so to open one and clean the dust off the parts might require moving three other servers out of the way in the rack. As such, [there was] lots of dust helping to insulate things inside the servers."
The room normally ran at between 68˚ and 70°F (~20˚and 21˚C), and Jason's first check showed the temperature had crept up to 80°F (26˚C). However, the NIPR, SIPR and comms equipment were all still running.
"Twenty minutes pass and I'm feeling a twinge of pride that our cobbled-together battery backup system has held up to the task. I also realise it's nearing a time where we will need to power down systems due to the temp now rising north of 90° F [~32˚C]."
Jason prepared to begin a controlled power down – where you'd shut down non-critical servers, but keep the comms equipment running so the phones at least would keep working – when the inevitable happened. The batteries abruptly died, taking everything down with them.
"With no power to the TCF, all we were left with was backup Defense Switched Network (DSN) phones (a basic POTS system for the base that few buildings were setup to use it)."
The gang eventually restored electricity after a battle with the power line switch, but were now faced with a new problem that will have many admins stroking their chins in recognition.
All of the domain controllers had been virtualized. The problem with that, as Jason explains, was: "We could power on the physical hardware for the virtual server but us lowly local admins did not have rights on the vSphere software to start the virtual domain controllers."
Thus there was little point in firing up the email or file servers.
"I had argued for keeping at least one physical domain controller but to no avail."
And, of course, with the NIPR and SIPR servers being a little unhappy, VoIP wasn't an option and naturally nobody had the DSN number contacting the support staff that could actually start the virtual servers.
All told, it took 25 minutes of calling around to find the right people while a large chunk of the TCF remained down.
The joy of government standard operating procedures, eh?
Jason finished by telling us: "I did enjoy writing up our after-action report on the major outage and getting to explain why it was extended an extra 40 minutes because we didn't have the simple rights to turn on a virtual server..."
Sometimes the words "I told you so" can be the most satisfying of all.
And that non-functional generator and balky switch? Still a mystery.
Ever found yourself at the sharp, pointy end of a routine test gone wrong or uttered the words "I warned you this would happen"? Of course you have, and you should tell On Call all about it. ®