Amazon S3-izure cause: Half the web vanished because an AWS bod fat-fingered a command
Basically, Team Bezos pulled a GitLab
Amazon has provided the postmortem for Tuesday's AWS S3 meltdown, shedding light on what caused one of its largest cloud facilities to bring a chunk of the web down.
In a note today to customers, the tech giant said the storage system was knocked offline by a staffer trying to address a problem with its billing system. Essentially, someone mistyped a command within a production environment while debugging a performance gremlin.
"The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process," the team wrote in its message.
"Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems."
Those two subsystems handled the indexing for objects stored on S3 and the allocation of new storage instances. Without these two systems operating, Amazon said it was unable to handle any customer requests for S3 itself, or those from services like EC2 and Lambda functions connected to S3.
As a result, websites small and large that relied on the cheap and popular Virginia US-East-1 region stopped working properly, costing hundreds of millions of dollars in losses for customers. It also broke smartphone apps and Internet of Things gadgets – from lightbulbs to Nest security cameras – that were relying on the S3 storage backend.
Among the collateral damages from the outage was the AWS service dashboard, which relies on S3 for some of its data, and with that service offline, could not be accessed by staff for about two hours to provide any updates. AWS says it was able to restore full S3 service and operations by 1:54 PM PST, nearly four and a half hours later.
"While this is an operation that we have relied on to maintain our systems since the launch of S3, we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years," the AWS team explained on Thursday.
"S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected."
Amazon says that it will be putting several safeguards in place to prevent similar outages in the future, including limiting the ability its debugging tools have to take multiple subsystems offline and partitioning the entire service into smaller "cells" that can individually be taken offline and updated without affecting other parts of S3.
"We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level," AWS said of the offending software.
"This will prevent an incorrect input from triggering a similar event in the future. We are also auditing our other operational tools to ensure we have similar safety checks."
Mistyping a command and crippling service for hours. Where have we heard that one before? ®