Netflix, Tinder, Airbnb and other big names were crippled or thrown offline for millions of people when Amazon suffered what's now revealed to be a cascade of cock-ups.
On Sunday, Amazon Web Services (AWS), which powers a good chunk of the internet, broke down and cut off websites from people eager to stream TV, or hookup with strangers; thousands complained they couldn't watch Netflix, chat up potential partners, find a place to crash via Airbnb, memorize trivia on IMDb, and so on.
Today, it's emerged the mega-outage was caused by vital systems in one part of AWS taking too long to send information to another part that was needed by customers.
Picture a steakhouse in which the cooks are taking so long to prepare the food, the side dishes have gone cold by the time the waiters and waitresses take the plates from the chef to the hungry diners. The orders have to be started again from scratch, the whole operation is overwhelmed, the chef walks out, and ultimately customers aren't getting fed. A typical Gordon Ramsay kitchen nightmare.
In technical terms, the internal metadata servers in AWS's DynamoDB database service were not answering queries from the storage systems within a particular time limit.
DynamoDB tables can be split into partitions scattered over many servers. These partitions are grouped into memberships, the details of which are stored in the metadata servers. The DynamoDB storage systems routinely poke the metadata service to make sure their membership records are up to date – that they are pulling the right partitions and table data from the right servers, in other words.
At about 0220 PT on Sunday, the metadata service was taking too long sending back answers to the storage servers. This was apparently due to AWS customers using Global Secondary Indexes in their databases, which increase the size of the partition membership information for a table: the internal metadata systems were struggling to take all this extra data, package it up, and pipe it to the storage systems.
At that moment on Sunday, the levee broke: too many taxing requests hit the metadata servers simultaneously, causing them to slow down and not respond to the storage systems in time. This forced the storage systems to stop handling requests for data from customers, and instead retry their membership queries to the metadata service – putting further strain on the cloud.
It got so bad AWS engineers were unable to send administrative commands to the metadata systems. At about 0500 PT, the team paused the service so that it could get in and make changes to handle the overwhelming workload. By 0700 PT, DynamoDB was staggering back to its feet.
Other services were hit by the outage: EC2 Auto Scaling, the Simple Queue Service, CloudWatch, and the AWS Console feature, suffered problems.
This is all according to a note from Amazon engineers, who described the downtime in full detail here if you're interested. The affected systems were in AWS's US-East region. In the team's own words:
With the metadata service under heavy load, it also no longer responded as quickly to storage servers uninvolved in the original network disruption, who were checking their membership data in the normal cadence of when they retrieve this information. Many of those storage servers also became unavailable for handling customer requests. Unavailable servers continued to retry requests for membership data, maintaining high load on the metadata service.
"Initially, we were unable to add capacity to the metadata service because it was under such high load, preventing us from successfully making the requisite administrative requests," the AWS bods continued. "Once adjustments were made, we were able to reactivate requests to the metadata service, put storage servers back into the customer request path, and allow normal load back on the metadata service."
For those relying on Amazon's cloud, here's the crucial part – how the team will stop the great crash from happening again:
Firstly, we have already significantly increased the capacity of the metadata service. Second, we are instrumenting stricter monitoring on performance dimensions, such as the membership size, to allow us to thoroughly understand their state and proactively plan for the right capacity. Third, we are reducing the rate at which storage nodes request membership data and lengthening the time allowed to process queries. Finally and longer term, we are segmenting the DynamoDB service so that it will have many instances of the metadata service each serving only portions of the storage server fleet. This will further contain the impact of software, performance/capacity, or infrastructure failures.
"We apologize for the impact to affected customers," the web giant concluded.
It's happening again
As your humble hack hammers away at the keyboard, the Amazon DynamoDB service in the US-East-1 region is suffering from "increased error rates", which started at 0633 PT today. Hours into the disruption, the team is battling to improve the situation.
At 1204 PT, the AWS gang admitted: "One of our mitigations has increased error rates and latencies for some tables. We are actively working to resolve these." Absolutely – wouldn't want anyone to miss out on Netflix'n'chill again. ®