Updated A single availability zone in Amazon Web Services’ EU-Central region (EUC_AZ-1) has experienced a major outage.
The internet giant's status page says the breakdown began at 1324 PDT (2030 UTC) on June 10, and initially caused “connectivity issues for some EC2 instances.”
Half an hour later AWS reported “increased API error rates and latencies for the EC2 APIs and connectivity issues for instances … caused by an increase in ambient temperature within a subsection of the affected Availability Zone.”
By 1436 PDT, AWS said temperatures were falling but that network connectivity remained down. But an hour later, the cloud colossus offered the following rather unsettling assessment:
While temperatures continue to return to normal levels, engineers are still not able to enter the affected part of the Availability Zone. We believe that the environment will be safe for re-entry within the next 30 minutes, but are working on recovery remotely at this stage.
A 1612 PDT update reported that staff were still unable to enter the site for safety reasons.
At 1633, network services were restored, an event AWS said should lead to swift resumption of EC2 instances. A 1719 update stated “environmental conditions within the affected Availability Zone have now returned to normal level,” and advised users that “the vast majority of affected EC2 instances have now fully recovered but we’re continuing to work through some EBS volumes that continue to experience degraded performance.”
Kinesis Data Streams, Kinesis Firehose, Amazon Relational Database Service, and AWS CloudFormation also wobbled.
AWS's most recent status update concluded: “We will provide further details on the root cause in a subsequent post, but can confirm that there was no fire within the facility.”
Which leaves the question of just what made the data centre too dangerous to enter?
The whole point of hypoxic gas release into data centres is to deprive fires of oxygen. And as humans need oxygen, it can be a while before engineers can return to a data centre.
The Register mentions this as it fits the facts offered in this incident, and with AWS’s language about “environmental conditions” preventing entry.
We will update this story if new information about this incident comes to hand. ®
Updated to add at 0245 UTC, June 11
AWS has updated its incident report (and mostly proven our analysis correct) by revealing that the incident was caused by "failure of a control system which disabled multiple air handlers in the affected Availability Zone."
The air handlers cool the data center, so once they stopped working "ambient temperatures began to rise" to unsafe levels, so AWS servers networking kit shut down.
"Unfortunately, because this issue impacted several redundant network switches, a larger number of EC2 instances in this single Availability Zone lost network connectivity," the update adds.
And now for the bit that we get to be smug about:
"While our operators would normally had been able to restore cooling before impact, a fire suppression system activated inside a section of the affected Availability Zone. When this system activates, the data center is evacuated and sealed, and a chemical is dispersed to remove oxygen from the air to extinguish any fire."
AWS staff had to wait for the local fire department to arrive and attest that the building was safe. Once that sign-off was secured, AWS says "the building needed to be re-oxygenated before it was safe for engineers to enter the facility and restore the affected networking gear and servers."
Safe working conditions were restored and so was most of the hardware and services. But it appears some kit was damaged, as AWS stated: "A very small number of remaining instances and volumes that were adversely affected by the increased ambient temperatures and loss of power remain unresolved."
The cloud giant also let clients know that the fire suppression system that activated remains disabled.