A practical demonstration of the difference between 'resilient' and 'redundant'
Bump and grind in the server racks
Who, Me? Monday is upon us, and with it comes a cautionary tale of how one Register reader's overconfidence led to his undoing, thanks to an unexpected interfacing with a belt buckle, in today's edition of Who, Me?
Our story comes from "Dan", a lead system admin at what he described as a "rather large company" where the vast majority of the business went though a single suite of application backed up by a central database.
"The number of zeros on the 'dollars per minute' lost in unscheduled downtime was frightening," he told us.
The company liked Dan. He had rocked up after his predecessor departed under a cloud and had spent quite a while dealing with the environment. He found Development, QA and Production all running with differing patch levels and occasionally even different OS versions. Configurations didn't match. Hardware architectures differed. And so on.
It took a while but, once Dan had lined everything up, downtime due to bugs being found in production dropped significantly. It's fair to say the company was very pleased. Perhaps a bit too pleased.
"There remained, however, one nasty fly in the ointment," he said. "The guys in sales have been promising our clients for years that our systems were resilient and redundant.
"Resilient, they had become. Redundant they weren't. Not in any sense of the word."
However, bit by bit, the company's systems were indeed slowly becoming redundant, aided by Dan "stalking the developer cubes with a bat to encourage the removal, or non-creation, of code that would not play nice with node failover."
The final piece of the redundancy jigsaw was the database. It ran on Sun hardware, and Dan's team could hotswap pretty much any hardware component without the system suffering any downtime. Resilient, for sure. But still not redundant despite the joy from management at the dramatic reduction in outages.
"They thought my team had learned to walk on water or something," recalled Dan, happily.
The next step was to do some hardware duplication and create a cluster capable of withstanding all manner of disaster. "The DBAs were practically salivating at the prospect," Dan recalled, but the numbers involved were large enough to invoke a "steady on, chaps" from management.
On the day in question, Dan received a routine trouble ticket. It looked like either an adapter card or Gigabit Interface Converter (GBIC) had died. No problem – he was already on site and there were plenty of spares in the data centre.
- Hacking the computer with wirewraps and soldering irons: Just fix the issues as they come up, right?
- Scalpel! Superglue! This mouse won't fix its own ball
- Electrocution? All part of the service, sir!
- Undebug my heart: Using Cisco's IOS to take down capitalism – accidentally
He asked a colleague to pop in a change request for a hotswap and headed into the computing sanctum to the do the deed. He had just enough time to get it sorted before heading off for lunch.
It transpired it wasn't the GBIC at fault, but the adapter card.
Fine. He'd done this many times. It was a simple case of powering down the system, pulling it out on its rails, replacing the card and firing it back up. Simple.
There was a slight wrinkle in the process, but not an unfamiliar one.
"You could reach ONE rail-lock easily from the front of the system," explained Dan. "Reaching both of them, however, you ended up hugging this massive and heavy box like a mother bear, flipping both locks and then starting the system back on its way into the rack with a little judicious hip pressure.
"Everyone on my team had done it dozens of times.
"This time, Murphy was watching and arranged for my belt buckle to occupy the exact same piece of the universe as the tiny, *unshielded* master power switch on the top panel at the front of the box...
"There was a click and microseconds later the pager on my belt went nuts as the only not-yet-redundant component of our entire business took a hard power outage."
Yes, Dan's lunch turned out to be very, very late that day. Still, the budget needed to add that last bit of redundancy arrived soon after.
Ever accidentally demonstrated just how stable (or not) your company's systems truly were? Or dropped some spectacles into the whirring blades of a PSU fan? Share your totally SFW tales of clothing or body parts causing IT chaos in an email to Who, Me? ®