On-Call Welcome again to On-Call, our regular Friday tale in which we ease you into the weekend with readers' tales of the ridiculous things they've been asked to do on evenings, weekends and the other odd times techie folk are asked to go out and fix things up.
This week's contribution comes from reader “JF”, who tells us that “About eight years back, I was co-ordinating a data center build-out.”
The existing facility sported a trio of 220KvA UPSes, but at the new bit barn it was decided to run with just two.
“The decision was made to save costs by relocating one of them to the new building rather than buying all new equipment.”
JF says he “begged the business to call a complete shutdown to remove the UPS. They asked me what the odds of something going wrong, and I made the error of trying to provide an accurate estimate of the risk by saying there was about a one in 100 chance of problems.”
JF thought a one per cent risk of power failure across 25,000 square feet packed full of server racks, live, in production, would scare off the bean counters.
He was wrong.
“The business figured that was a perfectly acceptable risk,” he recalls. So JF decided to come in on the Saturday of the move, just in case.
“I was sitting at a desk in the middle of the data center floor that weekend when the electricians began the delicate work of removing a 220KvA UPS unit from the mains,” JF wrote. “They put the system in bypass mode without a problem. They then cut the output breakers for the units to be removed. No problems. Then they wanted to isolate the inputs for the UPS units.”
Bad idea, because “they cut the input breaker for the master electrical panel, not the outputs that went to the UPS systems. That master panel also supplied the circuits to the bypass feed, which meant we had no power to anything at all.”
“25,000 square feet suddenly went silent. I ran into the electrical room expecting to find bit of dead electrician all over the place, but they were just calmly disconnecting wires.”
“I yelled, 'We're down!' and they said, 'No, we're in bypass mode.' I repeated, 'Noooo. We are down". They paused for 10 seconds and then their eyes got really wide.”
JF asked how long it would take to fix things up and was told an hour.
“Hurry!” JF exhorted, before calling everyone else in the business who could possibly help, such as folks on the application, sysadmin, network admin, and DBA teams. The business hadn't bothered to have any of them on-call, because with just a one percent chance of failure why did it need to bother?
“It took me about an hour to flip the power switches on every piece of equipment, and by then the entire IT department arrived, glaring death at me,” JF recalls. “It took about 36 hours to get everything back online. About 12 old storage arrays never did come back because they lost four or five disks each. Lots of network equipment had unsaved config files, and it took ages to fix that. We'd leased out a cage to another company who wasn't pleased either. They hadn't even been informed anything was being done.”
“The first thing I heard from management on Monday morning was, "I thought you said there was only a 1 in a 100 chance of failure!”
“I just stared at him,” JF says.
JF handed in his notice about a month later.
If your boss has ignored your scariest advice, or you've experienced other on-call messes, do let me know by writing me an electronic message. ®