AWS US East region endures eight-hour wobble thanks to 'Stuck IO' in Elastic Block Store
EC2 instances were impaired, Redshift hurt, and some of you may still struggle to access your data
Amazon Web Services' largest region yesterday experienced an eight-hour disruption with the Elastic Block Store (EBS) service that impacted several notable web sites and services.
The lack of fun started at 8:11pm PDT on Sunday, when EBS experienced "degraded performance" in one availability zone (USE1-AZ2) in the US-EAST-1 Region. A subsequent update described the issue as "Stuck IO" and warned that existing EC2 instances may "experience impairment" while new EC2 instances could fail.
US East is the only AWS Region to offer six availability zones – a reflection of its status as the company's first location.
Other AWS services – among them Redshift, OpenSearch, Elasticache, and RDS databases – experienced "connectivity issues" as well.
By 9:17pm, AWS felt that the number of EC2 instances impacted by the issue had plateaued, but users continued to experience difficulties.
By 9:47pm the beginnings of an explanation emerged, as AWS revealed "A subsystem within the larger EBS service that is responsible for coordinating storage hosts is currently degraded due to increased resource contention."
Among the organisations impacted were secure messaging app Signal …
Hold tight, folks! Signal is currently down, due to a hosting outage affecting parts of our service. We’re working on bringing it back up.— Signal (@signalapp) September 27, 2021
… and The New York Times games site (yes, your correspondent has a Spelling Bee problem).
Hi folks, our Games page is now up and running. Apologies for disturbing your morning routines, but back to your Monday solving! 🧩— NYTimes Wordplay (@NYTimesWordplay) September 27, 2021
At 10:23pm AWS explained it had made "several changes to address the increased resource contention within the subsystem responsible for coordinating storage hosts with the EBS service". While those changes "led to some improvement" an 11:19pm update reported only "some improvements" but admitted "we have not yet seen performance for affected volumes return to normal levels".
A minute later, AWS rolled a change. By 11:43pm, AWS was confident enough to report the mitigations had worked, and predicted EBS volume performance would return to normal levels within an hour.
- Tech contractors fume over payday outage at Giant Pay after it sniffs 'suspicious activity'
- Square-shaped hole in workers' wallets after payment system fails at peak tip time
- AWS Tokyo outage takes down banks, share traders, and telcos
But at 1:15am the next day, a glitch struck. Some restored services slowed down again, and some new volumes also experienced "degraded performance".
By 3:36am new EC2 instances were again booting without incident, and at 4:21am the cloud concern updated its status feed with news that full operations had been restored at 3:45am.
But the company also admitted "While almost all of EBS volumes have fully recovered, we continue to work on recovering a remaining small set of EBS volumes.
"While the majority of affected services have fully recovered, we continue to recover some services, including RDS databases and Elasticache clusters," the final update added.
Clouds. Sometimes it's hard to find the silver lining. ®