Networks

This article is more than 1 year old

A practical demonstration of the difference between 'resilient' and 'redundant'

Bump and grind in the server racks

Mon 6 Sep 2021 // 07:34 UTC

Who, Me? Monday is upon us, and with it comes a cautionary tale of how one Register reader's overconfidence led to his undoing, thanks to an unexpected interfacing with a belt buckle, in today's edition of Who, Me?

Our story comes from "Dan", a lead system admin at what he described as a "rather large company" where the vast majority of the business went though a single suite of application backed up by a central database.

"The number of zeros on the 'dollars per minute' lost in unscheduled downtime was frightening," he told us.

The company liked Dan. He had rocked up after his predecessor departed under a cloud and had spent quite a while dealing with the environment. He found Development, QA and Production all running with differing patch levels and occasionally even different OS versions. Configurations didn't match. Hardware architectures differed. And so on.

It took a while but, once Dan had lined everything up, downtime due to bugs being found in production dropped significantly. It's fair to say the company was very pleased. Perhaps a bit too pleased.

"There remained, however, one nasty fly in the ointment," he said. "The guys in sales have been promising our clients for years that our systems were resilient and redundant.

"Resilient, they had become. Redundant they weren't. Not in any sense of the word."

However, bit by bit, the company's systems were indeed slowly becoming redundant, aided by Dan "stalking the developer cubes with a bat to encourage the removal, or non-creation, of code that would not play nice with node failover."

The final piece of the redundancy jigsaw was the database. It ran on Sun hardware, and Dan's team could hotswap pretty much any hardware component without the system suffering any downtime. Resilient, for sure. But still not redundant despite the joy from management at the dramatic reduction in outages.

"They thought my team had learned to walk on water or something," recalled Dan, happily.

The next step was to do some hardware duplication and create a cluster capable of withstanding all manner of disaster. "The DBAs were practically salivating at the prospect," Dan recalled, but the numbers involved were large enough to invoke a "steady on, chaps" from management.

On the day in question, Dan received a routine trouble ticket. It looked like either an adapter card or Gigabit Interface Converter (GBIC) had died. No problem – he was already on site and there were plenty of spares in the data centre.

He asked a colleague to pop in a change request for a hotswap and headed into the computing sanctum to the do the deed. He had just enough time to get it sorted before heading off for lunch.

It transpired it wasn't the GBIC at fault, but the adapter card.

Fine. He'd done this many times. It was a simple case of powering down the system, pulling it out on its rails, replacing the card and firing it back up. Simple.

Or not.

There was a slight wrinkle in the process, but not an unfamiliar one.

"You could reach ONE rail-lock easily from the front of the system," explained Dan. "Reaching both of them, however, you ended up hugging this massive and heavy box like a mother bear, flipping both locks and then starting the system back on its way into the rack with a little judicious hip pressure.

"Everyone on my team had done it dozens of times.

"This time, Murphy was watching and arranged for my belt buckle to occupy the exact same piece of the universe as the tiny, *unshielded* master power switch on the top panel at the front of the box...

"There was a click and microseconds later the pager on my belt went nuts as the only not-yet-redundant component of our entire business took a hard power outage."

Yes, Dan's lunch turned out to be very, very late that day. Still, the budget needed to add that last bit of redundancy arrived soon after.

Ever accidentally demonstrated just how stable (or not) your company's systems truly were? Or dropped some spectacles into the whirring blades of a PSU fan? Share your totally SFW tales of clothing or body parts causing IT chaos in an email to Who, Me? ®

Topics

Special Features

Vendor Voice

Resources

Networks

A practical demonstration of the difference between 'resilient' and 'redundant'

Bump and grind in the server racks

More about

More about

Broader topics

More about

More about

More about

Broader topics

TIP US OFF

Other stories you might like

Rarest, strangest, form of Windows saved techie from moment of security madness

Tired techie 'fixed' a server, blamed Microsoft, and got away with it

Windows 95 support chap skipped a step and sent user into Micro-hell

Protecting distributed branch office environments from ransomware

You break it, you ... run away and hope somebody else fixes it

DBA made ten years of data disappear with one misplaced parameter

Yes, I did just crash that critical app. And you should thank me for having done so

Intern with superuser access 'promoted' himself to CEO

Health system network turned out to be a house of cards – Cisco cards, that is

If we plug this in without telling anyone, nobody will know we caused the outage

Self-taught-techie slept on the datacenter floor, survived communism, ended a marriage

'Crash test dummy' smashed VIP demo by offering a helping hand

About Us

Our Websites

Your Privacy