Resilience is overrated when it's not advertised

Successful failover can sometimes be a failure

On Call Nothing ruins a weekend like failed failover, which is why every Friday The Register brings readers a new instalment of On Call, the column in which we celebrate the readers whose recreation is ruined by rotten resilience regimes.

This week, meet “Brad” who once worked for a company that provided criminal justice apps to police departments.

Brad was in support and one of the systems he tended was a Data General server – actually a pair of them, because one was set up to fail over to the other.

"Set up" may be too kind a description. The pair of boxes shared a floating IP – but the wire linking them had never been connected. So while the servers were configured to fail over, they were not capable of doing so.

"This was the very early days of failover, and it had never been tested," Brad explained. "The principle was there, but we had plenty of testing to do before we actually went live with failover."

Fair enough, then.

While the servers were in that state of not-quite readiness, Brad was on call and took the inevitable 1:00 AM phone call with a client complaint about slow performance.

Brad's first tactic was to stay in bed for a bit – he wasn't allowed to remote into these servers and hoped the problem would go away by itself.

It didn't. So he then faced an hour-long drive to his office, from where remote access was permissible and possible.

After telnetting into the server, he found it was redlining.

"It was going bonkers: maxed out on CPU, maxed out on memory, pretty much just seized up," Brad told On Call.

No obvious problem was apparent to explain the server's condition, so Brad ended up rebooting it – even as the constabulary who were his clients expressed their displeasure at not being able to do little things like process the evening's catch of malfeasants.

Brad labored mightily through the night, and into the dawn, without being able to find a fix.

Then one of his colleagues arrived to work at something approaching normal business hours, and spotted that the server resources were way lower than they should have been.

"He checked the physical IP address – not the floating one I was using – and saw what had happened. The previous day, unknown to the rest of us, the senior engineer (who was steadfastly unrepentant) had been on site and had connected the wire between the live server and the failover server."

This story ends with one piece of good news and two pieces of bad news.

The good was failover had worked – the live server had a problem and the backup server kicked in as planned, even if it had not yet been tested!

The first piece of bad news was that the backup box had less than half the memory and CPU of the live server. "It was doing its level best to keep up with demand but just couldn't cope," Brad wrote.

The second was that failover only worked in one direction: from primary to backup. Brad's reboots had been applied to the backup box, which didn't surrender the workload and instead tried to do the job with its inexplicably paltry collection of resources.

The fix was easy. Brad's colleague forced the floating IP back over to the primary server, and suddenly all was well.

Brad and his pals later removed the wire connecting the two servers, making sure failover wasn't working again – even though the client thought it had a resilient rig!

"A couple of years later we moved to Sun servers and this time made sure we tested failover before going live," Brad said.

Which may go some way towards explaining why EMC later acquired a stricken Data General for a reasonable sum in 1999!

Have you ever been confused by tech that worked when it wasn't supposed to? If so, click here to send On Call an email and we may share your story here on a future Friday. On Call has appeared without fail for years, but our resilience is not strong at present – we could use plenty more stories to keep the column at its best. ®

More about

More about

More about


Send us news

Other stories you might like