AWS celebrates Labor Day weekend by roasting customer data in US-East-1 BBQ

Postmortem report: Power outage knackered instances, volumes for unlucky punters


A power outage fried hardware within one of Amazon Web Services' data centers during America's Labor Day weekend, causing some customer data to be lost.

When the power went out, and backup generators subsequently failed, some virtual server instances evaporated – and some cloud-hosted volumes were destroyed and had to be restored from backups, where possible, we're told.

A Register reader today tipped us off that on Saturday morning, Amazon's cloud biz started suffering a breakdown within its US-East-1 region.

Our tipster told us they had more than 1TB of data in Amazon's cloud-hosted Elastic Block Store (EBS), which disappeared during the outage: they were told "the underlying hardware related to your EBS volume has failed, and the data associated with the volume is unrecoverable."

Our reader, who asked to remain anonymous, was able to restore their data by hand from an EBS snapshot conveniently taken roughly eight hours earlier. Without this backup, they may not have been able to recover any of the lost information: Amazon's engineers were able to resuscitate the vast majority of downed systems, though not every storage volume survived the hard crash.

Zuckerberg

Facebook blames 'server config change' for 14-hour outage. Someone run that through the universal liar translator

READ MORE

Unlucky customers who had data on the zapped storage systems were told by AWS staff that, despite attempts to revive the missing bits and bytes, some of the ones and zeroes were permanently scrambled: "A small number of volumes were hosted on hardware which was adversely affected by the loss of power. However, due to the damage from the power event, the EBS servers underlying these volumes have not recovered.

"After further attempts to recover these volumes, they were determined to be unrecoverable."

Meanwhile, one customer and tech consultant, Andy Hunt, not only complained on Twitter that their data was trashed in the power cut, but also claimed the cause of the failure wasn't swiftly communicated to subscribers: "AWS had a power failure, their backup generators failed, which killed their EBS servers, which took all of our data with it. Then it took them four days to figure this out and tell us about it.

"Reminder: The cloud is just a computer in Reston with a bad power supply."

A spokesperson for AWS was not available for comment.

'Impaired'

Although some details about the downtime were published, albeit buried, on AWS's status page, El Reg has seen a more detailed series of notices sent to customers explaining the blunder.

At just before 1100 PDT that day, AWS noted that, at about 0430 PDT, "one of ten data centers in one of the six Availability Zones in the US-East-1 Region saw a failure of utility power. Backup generators came online immediately, but for reasons we are still investigating, began quickly failing at around 0600 PDT."

"This resulted in 7.5 per cent of all instances in that Availability Zone failing by 0610 PDT," it continued. "Over the last few hours we have recovered most instances but still have 1.5 per cent of the instances in that Availability Zone remaining to be recovered. Similar impact existed to EBS and we continue to recover volumes within EBS. New instance launches in this zone continue to work without issue."

Roughly a couple of hours later, at 1330 PDT, the cloud goliath clarified and expanded its note as follows:

At 0433 PDT one of ten data centers in one of the six Availability Zones in the US-East-1 Region saw a failure of utility power. Our backup generators came online immediately but began failing at around 0600 PDT. This impacted 7.5 per cent of EC2 instances and EBS volumes in the Availability Zone.

Power was fully restored to the impacted data center at 0745 PDT. By 1045 PDT, all but one per cent of instances had been recovered, and by 1230 PDT only 0.5 per cent of instances remained impaired. Since the beginning of the impact, we have been working to recover the remaining instances and volumes. A small number of remaining instances and volumes are hosted on hardware which was adversely affected by the loss of power. We continue to work to recover all affected instances and volumes and will be communicating to the remaining impacted customers via the Personal Health Dashboard. For immediate recovery, we recommend replacing any remaining affected instances or volumes if possible.

So, in effect, according to Amazon, early on Saturday morning, US West Coast time, an AWS data center lost power, then an hour and a half later, the backup generators failed, taking down just one in ten EC2 virtual machines and EBS volumes in that availability zone.

A few hours later, 99.5 per cent of affected systems had been recovered, and of those still "impaired," some were unrecoverable, forcing subscribers to pull out their backups – assuming they kept them. ®

Broader topics


Other stories you might like

  • US won’t prosecute ‘good faith’ security researchers under CFAA
    Well, that clears things up? Maybe not.

    The US Justice Department has directed prosecutors not to charge "good-faith security researchers" with violating the Computer Fraud and Abuse Act (CFAA) if their reasons for hacking are ethical — things like bug hunting, responsible vulnerability disclosure, or above-board penetration testing.

    Good-faith, according to the policy [PDF], means using a computer "solely for purposes of good-faith testing, investigation, and/or correction of a security flaw or vulnerability."

    Additionally, this activity must be "carried out in a manner designed to avoid any harm to individuals or the public, and where the information derived from the activity is used primarily to promote the security or safety of the class of devices, machines, or online services to which the accessed computer belongs, or those who use such devices, machines, or online services."

    Continue reading
  • Intel plans immersion lab to chill its power-hungry chips
    AI chips are sucking down 600W+ and the solution could be to drown them.

    Intel this week unveiled a $700 million sustainability initiative to try innovative liquid and immersion cooling technologies to the datacenter.

    The project will see Intel construct a 200,000-square-foot "mega lab" approximately 20 miles west of Portland at its Hillsboro campus, where the chipmaker will qualify, test, and demo its expansive — and power hungry — datacenter portfolio using a variety of cooling tech.

    Alongside the lab, the x86 giant unveiled an open reference design for immersion cooling systems for its chips that is being developed by Intel Taiwan. The chip giant is hoping to bring other Taiwanese manufacturers into the fold and it'll then be rolled out globally.

    Continue reading
  • US recovers a record $15m from the 3ve ad-fraud crew
    Swiss banks cough up around half of the proceeds of crime

    The US government has recovered over $15 million in proceeds from the 3ve digital advertising fraud operation that cost businesses more than $29 million for ads that were never viewed.

    "This forfeiture is the largest international cybercrime recovery in the history of the Eastern District of New York," US Attorney Breon Peace said in a statement

    The action, Peace added, "sends a powerful message to those involved in cyber fraud that there are no boundaries to prosecuting these bad actors and locating their ill-gotten assets wherever they are in the world."

    Continue reading

Biting the hand that feeds IT © 1998–2022