A practical demonstration of the difference between 'resilient' and 'redundant'

Bump and grind in the server racks

Who, Me? Monday is upon us, and with it comes a cautionary tale of how one Register reader's overconfidence led to his undoing, thanks to an unexpected interfacing with a belt buckle, in today's edition of Who, Me?

Our story comes from "Dan", a lead system admin at what he described as a "rather large company" where the vast majority of the business went though a single suite of application backed up by a central database.

"The number of zeros on the 'dollars per minute' lost in unscheduled downtime was frightening," he told us.

The company liked Dan. He had rocked up after his predecessor departed under a cloud and had spent quite a while dealing with the environment. He found Development, QA and Production all running with differing patch levels and occasionally even different OS versions. Configurations didn't match. Hardware architectures differed. And so on.

It took a while but, once Dan had lined everything up, downtime due to bugs being found in production dropped significantly. It's fair to say the company was very pleased. Perhaps a bit too pleased.

"There remained, however, one nasty fly in the ointment," he said. "The guys in sales have been promising our clients for years that our systems were resilient and redundant.

"Resilient, they had become. Redundant they weren't. Not in any sense of the word."

However, bit by bit, the company's systems were indeed slowly becoming redundant, aided by Dan "stalking the developer cubes with a bat to encourage the removal, or non-creation, of code that would not play nice with node failover."

The final piece of the redundancy jigsaw was the database. It ran on Sun hardware, and Dan's team could hotswap pretty much any hardware component without the system suffering any downtime. Resilient, for sure. But still not redundant despite the joy from management at the dramatic reduction in outages.

"They thought my team had learned to walk on water or something," recalled Dan, happily.

The next step was to do some hardware duplication and create a cluster capable of withstanding all manner of disaster. "The DBAs were practically salivating at the prospect," Dan recalled, but the numbers involved were large enough to invoke a "steady on, chaps" from management.

On the day in question, Dan received a routine trouble ticket. It looked like either an adapter card or Gigabit Interface Converter (GBIC) had died. No problem – he was already on site and there were plenty of spares in the data centre.

He asked a colleague to pop in a change request for a hotswap and headed into the computing sanctum to the do the deed. He had just enough time to get it sorted before heading off for lunch.

It transpired it wasn't the GBIC at fault, but the adapter card.

Fine. He'd done this many times. It was a simple case of powering down the system, pulling it out on its rails, replacing the card and firing it back up. Simple.

Or not.

There was a slight wrinkle in the process, but not an unfamiliar one.

"You could reach ONE rail-lock easily from the front of the system," explained Dan. "Reaching both of them, however, you ended up hugging this massive and heavy box like a mother bear, flipping both locks and then starting the system back on its way into the rack with a little judicious hip pressure.

"Everyone on my team had done it dozens of times.

"This time, Murphy was watching and arranged for my belt buckle to occupy the exact same piece of the universe as the tiny, *unshielded* master power switch on the top panel at the front of the box...

"There was a click and microseconds later the pager on my belt went nuts as the only not-yet-redundant component of our entire business took a hard power outage."

Yes, Dan's lunch turned out to be very, very late that day. Still, the budget needed to add that last bit of redundancy arrived soon after.

Ever accidentally demonstrated just how stable (or not) your company's systems truly were? Or dropped some spectacles into the whirring blades of a PSU fan? Share your totally SFW tales of clothing or body parts causing IT chaos in an email to Who, Me? ®

Similar topics

Broader topics

Other stories you might like

  • SEC probes Musk for not properly disclosing Twitter stake
    Meanwhile, social network's board rejects resignation of one its directors

    America's financial watchdog is investigating whether Elon Musk adequately disclosed his purchase of Twitter shares last month, just as his bid to take over the social media company hangs in the balance. 

    A letter [PDF] from the SEC addressed to the tech billionaire said he "[did] not appear" to have filed the proper form detailing his 9.2 percent stake in Twitter "required 10 days from the date of acquisition," and asked him to provide more information. Musk's shares made him one of Twitter's largest shareholders. The letter is dated April 4, and was shared this week by the regulator.

    Musk quickly moved to try and buy the whole company outright in a deal initially worth over $44 billion. Musk sold a chunk of his shares in Tesla worth $8.4 billion and bagged another $7.14 billion from investors to help finance the $21 billion he promised to put forward for the deal. The remaining $25.5 billion bill was secured via debt financing by Morgan Stanley, Bank of America, Barclays, and others. But the takeover is not going smoothly.

    Continue reading
  • Cloud security unicorn cuts 20% of staff after raising $1.3b
    Time to play blame bingo: Markets? Profits? Too much growth? Russia? Space aliens?

    Cloud security company Lacework has laid off 20 percent of its employees, just months after two record-breaking funding rounds pushed its valuation to $8.3 billion.

    A spokesperson wouldn't confirm the total number of employees affected, though told The Register that the "widely speculated number on Twitter is a significant overestimate."

    The company, as of March, counted more than 1,000 employees, which would push the jobs lost above 200. And the widely reported number on Twitter is about 300 employees. The biz, based in Silicon Valley, was founded in 2015.

    Continue reading
  • Talos names eight deadly sins in widely used industrial software
    Entire swaths of gear relies on vulnerability-laden Open Automation Software (OAS)

    A researcher at Cisco's Talos threat intelligence team found eight vulnerabilities in the Open Automation Software (OAS) platform that, if exploited, could enable a bad actor to access a device and run code on a targeted system.

    The OAS platform is widely used by a range of industrial enterprises, essentially facilitating the transfer of data within an IT environment between hardware and software and playing a central role in organizations' industrial Internet of Things (IIoT) efforts. It touches a range of devices, including PLCs and OPCs and IoT devices, as well as custom applications and APIs, databases and edge systems.

    Companies like Volvo, General Dynamics, JBT Aerotech and wind-turbine maker AES are among the users of the OAS platform.

    Continue reading

Biting the hand that feeds IT © 1998–2022