AWS blames 'latent bug' for prolonging Sydney EC2 outage
First the secondary generators failed, then the glitch slowed some service restoration
Amazon Web Services has explained the extended outage its Sydney services suffered last weekend, attributing downtime to a combination of power problems and a “latent bug in our instance management software”.
Sydney recorded over 150mm of rain on last weekend. On Sunday the 5th the city copped 93 mm alone, plus winds gusting to 96 km/h.
Amazon says that bad weather meant that “At 10:25 PM PDT on June 4th [mid-afternoon Sunday in Sydney – Ed] , our utility provider suffered a loss of power at a regional substation as a result of severe weather in the area. This failure resulted in a total loss of utility power to multiple AWS facilities.”
AWS has two backup power systems, but for some instances both backups failed on the night in question.
The cloud colossus' explanation says its backups employ a “diesel rotary uninterruptable power supply (DRUPS), which integrates a diesel generator and a mechanical UPS.”
“Under normal operation, the DRUPS uses utility power to spin a flywheel which stores energy. If utility power is interrupted, the DRUPS uses this stored energy to continue to provide power to the datacenter while the integrated generator is turned on to continue to provide power until utility power is restored.”
Last weekend, however, “a set of breakers responsible for isolating the DRUPS from utility power failed to open quickly enough.” That was bad because these breakers should “assure that the DRUPS reserve power is used to support the datacenter load during the transition to generator power.”
“Instead, the DRUPS system’s energy reserve quickly drained into the degraded power grid.”
That failure meant the diesels couldn't send any juice to the data centre, which promptly fell over.
AWS techs got things running again at 11:46PM PDT and by 1:00 AM PDT on the 5th, “over 80% of the impacted customer instances and volumes were back online and operational.” Some workloads were slower to recover, thanks to what AWS calls “DNS resolution failures as the internal DNS hosts for that Availability Zone were brought back online and handled the recovery load.”
But some instances didn't come back. AWS now says that was due to “A latent bug in our instance management software” that meant some instances needed to be restored manually. AWS hasn't explained the nature of that bug.
Other instances were impacted by dead disks that meant data was not immediately available. Manual work was required to restore data.
As is always the case after such messes, AWS has promised to harden the designs that failed.
“While we have experienced excellent operational performance from the power configuration used in this facility,” the mea culpa says, “it is apparent that we need to enhance this particular design to prevent similar power sags from affecting our power delivery infrastructure.”
More breakers are the order of the day, “to assure that we more quickly break connections to degraded utility power to allow our generators to activate before the UPS systems are depleted.”
Software improvements are also planned, including “changes that will assure our APIs are even more resilient to failure” so that those using multiple AWS regions can rely on failover between bit barns.
Those changes should land in the Sydney region in July.
AWS is far from alone in suffering physical or software problems with its cloud. Salesforce has also had strife with circuit breakers.. Google broke its own cloud with a bug and lost data after a lightning strike.
Clouds, eh? Reliable until they're not! ®