Uptime guarantees don't apply when you turn a machine off, then on again, to 'fix' it
The chap who took the support call for the SEV-1 incident survived – just
On Call With another working week almost behind us, The Register has found another tale to tell in our On-Call column – the home of reader-contributed stories of thankless tech support tasks.
This week meet "Rod" who, in the late 1990s, found himself in a new country, seeking a new job.
Rod had dabbled with computers in the past without ever working in the trade. But his dabbling was at a high level: he built a Linux PC back when Slackware came on floppy disks.
That skill saw Rod offered a position at a small, remote, outpost of a major PC-maker's helpdesk.
We rebooted the system … it has lost all configuration and data
Rod was good at the job and quite liked it. He was unusual in that regard, as attrition was very high. The vendor paying meagre wages – when it remembered to pay them at all – had a bit to do with the rate of staff turnover.
He stuck around, and when the vendor started to move out of mere PCs and into proper tech like servers and storage, he was asked to join the support team.
"I was trained with the first group of support agents, then attrition made me last man standing of the cohort," Rod told On-Call. "Next, I kind of automatically became one of the trainers, and soon the only trainer for servers and SANs."
"With these promotions, my salary quickly rose to nearly double the company entry level; and with attrition still rife, it looked like I was set for life – not moving anywhere soon."
A few years later, the company introduced a flagship storage product that was bigger, better, faster – and sold with a 99.999 percent uptime guarantee. The service plan bundled with the product promised that any failure would be met by the instant intervention of a crack support squad that could recite the machine's source code backwards while wrestling a crocodile underwater.
By then, Rod had trained many fellow support agents on the product. But – staff attrition being what it was – he was the only person on staff in his region who had spent a lot of time with and really understood the storage box.
Which is why the first SEV-1 call to Rod's team did not go down well.
"I had just gotten up from my desk to wander over to the training room across a very large open plan office space, when I noticed something was amiss," Rod told On-Call. "A person had stood up, a second followed, a third, and more, all facing, then pointing to the same unfortunate support agent, who was looking very pale, possibly sick. Before I reached the desk whispers had reached me 'A customer's storage system is in trouble. And it's one of those systems we promise will never go down!'"
Rod's team all but begged him to take the call. So he commandeered a PC, donned a headset, and started talking to the customer.
"You appear to have a major problem at hand, sorry to hear, can you tell me happened?" Rod opened, using a tactic of displaying empathy for the customer.
"Yes, we rebooted the system, and it complains it has lost all configuration and data," was the reply.
Which left Rod trying to fix a device that was sold with an uptime guarantee, and a customer who had deliberately caused downtime!
Or was it the customer? Training tech support people had taught Rod it is better to ask why something happened than who made it happen.
Rod therefore asked why the machine had been turned off, and was told an outsourced chap did the deed in order to solve a server problem.
Rod explained that machines designed to achieve 99.999 percent uptime were not designed to be turned off and turned on again. Indeed, such boxes are not really designed to be turned off at all – their reboot processes can be rather elaborate.
"The rest was plain sailing," Rod recalled. His employer could not be blamed, and Rod's thinly skilled team was off the hook. He duly guided the customer through the correct startup sequence, restored the storage box and the servers that relied on it, squaring away a SEV-1 incident in minutes.
"I was quite satisfied and proud it all went so smoothly," Rod told On-Call. "The customer who called in was totally happy, appropriately grateful, and relieved to be off the hook. The colleague whose day I had just saved was ecstatic. Even direct managers expressed their appreciation – which they did not do very often – because we had solved our first SEV-1 within minutes!"
But there's a sting in this tale. Rod and his team were informed that the problem was not handled correctly – they had not immediately despatched a crocodile-wrestling engineer before commencing troubleshooting.
Rod's team was, therefore, relieved of storage support duties.
- Errors logged as 'nut loose on the keyboard' were – ahem – not a hardware problem
- Techie fired for inventing an acronym – and accidentally applying it to the boss
- Duelling techies debugged printer by testing the strength of electric shocks
- Service desk tech saved consultancy Capita from VPN meltdown, got a smack for it
In his missive to On-Call, Rod suggested weird internal politics was to blame for that shift.
"My manager stood by me. I was not hung out to dry for that single 'breach of protocol' and proceeded to deliver training on everything but storage to hordes of new hires, ad infinitum," Rod wrote.
Rod stuck in the job long enough for the wheel to turn: a re-org or two later he was again asked to take on storage support, trained another group of colleagues, and watched them leave with the skills he had imparted on their CVs.
How have you survived SEV-1 incidents? Click here to send On-Call an email so we can share your story. ®