AWS Frankfurt experiences major breakdown that staff couldn’t fix for hours due to ‘environmental conditions’ on data centre floor

Cloud colossus says aircon fail caused kit to shut down, networks dropped out and EC2 instances went dark

Updated A single availability zone in Amazon Web Services’ EU-Central region (EUC_AZ-1) has experienced a major outage.

The internet giant's status page says the breakdown began at 1324 PDT (2030 UTC) on June 10, and initially caused “connectivity issues for some EC2 instances.”

Half an hour later AWS reported “increased API error rates and latencies for the EC2 APIs and connectivity issues for instances … caused by an increase in ambient temperature within a subsection of the affected Availability Zone.”

By 1436 PDT, AWS said temperatures were falling but that network connectivity remained down. But an hour later, the cloud colossus offered the following rather unsettling assessment:

While temperatures continue to return to normal levels, engineers are still not able to enter the affected part of the Availability Zone. We believe that the environment will be safe for re-entry within the next 30 minutes, but are working on recovery remotely at this stage.

A 1612 PDT update reported that staff were still unable to enter the site for safety reasons.

At 1633, network services were restored, an event AWS said should lead to swift resumption of EC2 instances. A 1719 update stated “environmental conditions within the affected Availability Zone have now returned to normal level,” and advised users that “the vast majority of affected EC2 instances have now fully recovered but we’re continuing to work through some EBS volumes that continue to experience degraded performance.”

Kinesis Data Streams, Kinesis Firehose, Amazon Relational Database Service, and AWS CloudFormation also wobbled.

AWS's most recent status update concluded: “We will provide further details on the root cause in a subsequent post, but can confirm that there was no fire within the facility.”

Which leaves the question of just what made the data centre too dangerous to enter?

While we lack any evidence on which to base an assertion, The Register has reported on erupting UPSes and tiny puffs of smoke leading to hypoxic gas being released into data centres.

The whole point of hypoxic gas release into data centres is to deprive fires of oxygen. And as humans need oxygen, it can be a while before engineers can return to a data centre.

The Register mentions this as it fits the facts offered in this incident, and with AWS’s language about “environmental conditions” preventing entry.

We will update this story if new information about this incident comes to hand. ®

Updated to add at 0245 UTC, June 11

AWS has updated its incident report (and mostly proven our analysis correct) by revealing that the incident was caused by "failure of a control system which disabled multiple air handlers in the affected Availability Zone."

The air handlers cool the data center, so once they stopped working "ambient temperatures began to rise" to unsafe levels, so AWS servers networking kit shut down.

"Unfortunately, because this issue impacted several redundant network switches, a larger number of EC2 instances in this single Availability Zone lost network connectivity," the update adds.

And now for the bit that we get to be smug about:

"While our operators would normally had been able to restore cooling before impact, a fire suppression system activated inside a section of the affected Availability Zone. When this system activates, the data center is evacuated and sealed, and a chemical is dispersed to remove oxygen from the air to extinguish any fire."

AWS staff had to wait for the local fire department to arrive and attest that the building was safe. Once that sign-off was secured, AWS says "the building needed to be re-oxygenated before it was safe for engineers to enter the facility and restore the affected networking gear and servers."

Safe working conditions were restored and so was most of the hardware and services. But it appears some kit was damaged, as AWS stated: "A very small number of remaining instances and volumes that were adversely affected by the increased ambient temperatures and loss of power remain unresolved."

The cloud giant also let clients know that the fire suppression system that activated remains disabled.

Similar topics

Narrower topics

Other stories you might like

  • Pentester pops open Tesla Model 3 using low-cost Bluetooth module
    Anything that uses proximity-based BLE is vulnerable, claim researchers

    Tesla Model 3 and Y owners, beware: the passive entry feature on your vehicle could potentially be hoodwinked by a relay attack, leading to the theft of the flash motor.

    Discovered and demonstrated by researchers at NCC Group, the technique involves relaying the Bluetooth Low Energy (BLE) signals from a smartphone that has been paired with a Tesla back to the vehicle. Far from simply unlocking the door, this hack lets a miscreant start the car and drive away, too.

    Essentially, what happens is this: the paired smartphone should be physically close by the Tesla to unlock it. NCC's technique involves one gadget near the paired phone, and another gadget near the car. The phone-side gadget relays signals from the phone to the car-side gadget, which forwards them to the vehicle to unlock and start it. This shouldn't normally happen because the phone and car are so far apart. The car has a defense mechanism – based on measuring transmission latency to detect that a paired device is too far away – that ideally prevents relayed signals from working, though this can be defeated by simply cutting the latency of the relay process.

    Continue reading
  • Google assuring open-source code to secure software supply chains
    Java and Python packages are the first on the list

    Google has a plan — and a new product plus a partnership with developer-focused security shop Snyk — that attempts to make it easier for enterprises to secure their open source software dependencies.

    The new service, announced today at the Google Cloud Security Summit, is called Assured Open Source Software. We're told it will initially focus on some Java and Python packages that Google's own developers prioritize in their workflows. 

    These two programming languages have "particularly high-risk profiles," Google Cloud Cloud VP and GM Sunil Potti said in response to The Register's questions. "Remember Log4j?" Yes, quite vividly.

    Continue reading
  • Rocket Lab is taking NASA's CAPSTONE to the Moon
    Mission to lunar orbit is further than any Photon satellite bus has gone before

    Rocket Lab has taken delivery of NASA's CAPSTONE spacecraft at its New Zealand launch pad ahead of a mission to the Moon.

    It's been quite a journey for CAPSTONE [Cislunar Autonomous Positioning System Technology Operations and Navigation Experiment], which was originally supposed to launch from Rocket Lab's US launchpad at Wallops Island in Virginia.

    The pad, Launch Complex 2, has been completed for a while now. However, delays in certifying Rocket Lab's Autonomous Flight Termination System (AFTS) pushed the move to Launch Complex 1 in Mahia, New Zealand.

    Continue reading

Biting the hand that feeds IT © 1998–2022