We have redundancy, we have batteries, what could possibly go wrong?
Practise makes... less than perfect
On Call A Register reader finds the inevitable single point of failure after a call-out to the heart of darkness in this week's On Call.
Our story, from a reader we will call "Philip", takes us back to the 1990s. It was a heady time of hulking on-premises servers, demanding customers, and an in-house call center. "What could possibly go wrong?" he asked.
"Well, everything as it happens."
Not that Philip's employer had failed to take every possible step to ensure business would continue no matter what befell the outside world. Racks of eye-wateringly expensive 9GB drives filled the data room. There were twinned VME servers, redundant power supplies, and comms kit. And, of course, there were banks of batteries to keep the juice flowing smoothly until a generator could kick in during a blackout.
"Everything," said Philip, "right down to the coffee machines and front door lock ran via this smoothed and protected power."
"The only things that weren't battery-backed were the desk-side uplighters that someone in HR had decided would cast a work-inducing orange glow over the staff" and the single elevator.
It was a shining example of correct and proper site operation. Backups were performed and rotated and once a week and the site generator was fired up so the power feed could be checked and the "OK" box ticked on the checklist. "It always was," said Philip. "This was a model site, only very recently commissioned and everyone was very happy with it all."
Until the fateful day when Philip's pager went off. Work on a nearby road had resulted in a power cable being severed. All was well – the switchover had gone as planned. However, he said, everyone knew he'd figured out how to open stuck elevator doors and "we had a couple of people who'd decided the lift was a good place to hold 'a private discussion'," explained Philip, "so could I please work my magic on the door release?"
- Saving a loved one from a document disaster
- Your app deleted all my files. And my wallpaper too!
- File suffixes: Who needs them? Well, this guy did
- Real-time software? How about real-time patching?
So to the office Philip drove. Twenty minutes later he was parking and could see the soft orange glow from the uplighters shining through the windows, meaning the generator was running.
"However, after locking the car and once again turning towards the building, I saw no orange glow. Actually, I saw no lights at all. What I could see was a growing number of faces pressed up against the windows."
His security pass also wasn't working. No problem, though – he had an old-fashioned key and let himself in to find the building completely dark and silent (aside from confused noises from staff and other sounds from whatever was happening in the elevator).
"Not only had the god-awful uplighters died, so had power to the entire building," he said. The generator had stopped.
"The moral of this story? If, perchance, the maintenance chap responsible for that site is a reader of El Reg, could I use this medium to remind him that if he's going to test run the generator once a week then you'll need to refill the bleedin' fuel tank occasionally."
Happy to oblige, and we're delighted to note that incompetence when it comes to keeping generator fuel tanks topped up appears to be a worldwide thing. However, while our Australian friends managed to keep their site running until fuel arrived, Philip's went down. And went down hard.
"I just wish I'd been a fly on the wall at the debrief," recalled Philip. "They spend a fortune making sure that site stayed up, everything had dual PSUs or warm spares and it all failed because some oik forgot to buy a few pounds worth of red diesel."
Single points of failure – there's always one where you least expect it. Tell us about your call out into darkness with an email to On Call. ®