Updated Cloud operator Joyent went through a major failure on Tuesday when a fat-fingered admin brought down an entire data center's compute assets.
The cloud provider began reporting "transient availability issues" for its US-East-1 data center at around six-thirty in the evening, East Coast time.
"Due to an operator error, all compute nodes in us-east-1 were simultaneously rebooted," Joyent wrote. "Some compute nodes are already back up, but due to very high load on the control plane, this is taking some time. We are dedicating all operational and engineering resources to getting this issue resolved, and will be providing a full postmortem on this failure once every compute node and customer VM is online and operational. We will be providing frequent updates until the issue is resolved."
The problems were mostly fixed an hour or so later.
For those not familiar with the cloud, a datacenter-wide forced reboot on all servers is just about the worst thing that can happen to a provider aside from the deletion of customer data, or multiple data centers going down simultaneously.
"While the immediate cause was operator error, there are broader systemic issues that allowed a fat finger to take down a datacenter," explained Joyent's chief technology officer Bryan Cantrill in a post to Hacker News. "As soon as we reasonably can, we will be providing a full postmortem of this: how this was architecturally possible, what exactly happened, how the system recovered, and what improvements we are/will be making to both the software and to operational procedures to assure that this doesn't happen in the future".
Joyent has service-level agreements in place that will compensate customers for downtime, we understand.
In going through such a stomach-churning fault, Joyent has joined an illustrious group of service providers that includes Rackspace, Microsoft, Google, and Amazon which have all had similarly catastrophic failures.
"Anything that allows you to administer many, many machines will allow you to do this," Cantrill told The Reg in a phone conversation. "There was a silver lining here in the sense it was an opportunity to see how the system behaved. There are lots of ways it could have been much worse."
Joyent will try to learn from the experience and will publish a full post-mortem as well.
As for the fat-fingered administrator? "The operator that made the error is mortified, there is nothing we could do or say for that operator that is going to make it any worse, frankly," Cantrill said.
Nor would Joyent want to, he explained. The goal for the company is to learn from the problem and get better, not mete out punishment. "You don't teach dolphins with a shock collar," Cantrill explained.
Joyent has now published a post-mortem on the incident.
The cause of the outage was that an admin was using a tool to remotely update the software on some new servers in Joyent's data center and, when trying to reboot them, accidentally rebooted all of the servers in the facility.
"The command to reboot the select set of new systems that needed to be updated was mis-typed, and instead specified all servers in the datacenter," Joyent wrote. "Unfortunately the tool in question does not have enough input validation to prevent this from happening without extra steps/confirmation, and went ahead and issued a reboot command to every server in us-east-1 availability zone without delay."
At this point we imagine the operator emitted a high-pitched "oh dear oh dear oh dear" before watching the inevitable brown-out occur.
Bringing the system back online took such a long time because the rebooted servers all flooded the boot infrastructure with configuration requests, Joyent explained.
"First, we will be dramatically improving the tooling that humans (and systems) interact with such that input validation is much more strict and will not allow for all servers, and control plane servers to be rebooted simultaneously," Joyent said. "We want to reiterate our apology for the magnitude of this issue and the impact it caused our customers and their customers. We will be working as diligently as we can, and as expediently as we can, to prevent an issue like this from happening again."
El Reg would like to commend Joyent for its transparency about the outage and has made one virtual Sorry You Borked A Bit Barn pint available to the operator that caused the error. Interested parties can provide additional pints by selecting the beer icon in the comments below. ®