AWS wobbles in US East region causing widespread outages
'We have identified the root cause and we are actively working towards recovery' - which now appears almost complete
Updated Technical errors with the US-EAST-1 region of Amazon Web Services have caused widespread woes for customers, including difficulty accessing the management console and some other service problems.
The issues appear to be centred on the US-EAST-1 region, which is the oldest AWS region and located in Northern Virginia. This can have a global impact, as AWS noted in its status report:
"This issue is affecting the global console landing page, which is also hosted in US-EAST-1."
Customers may be able to access region-specific consoles, the company said, by going directly to the URL for that region.
Within US-EAST-1 though, the affected services are not just the console, but also EC2 (Elastic Compute Cloud), DynamoDB and Amazon Connect. In reality, if EC2 is not working correctly hundreds of other services can be impacted since they run on EC2 behind the scenes.
Twitter filled with frustrated customers, as well as suppliers apologising to their customers for the outage. Even vendors of cloud services were impacted, as many of these also run on AWS, such as Elastic Cloud which reported: "We are experiencing issues with capacity scaling related to the elevated errors rates within us-east-1 (AWS N. Virginia) and are monitoring the situation."
- Fastly 'fesses up to breaking the internet with an 'an undiscovered software bug' triggered by a customer
- If you can't log into Azure, Teams or Xbox Live right now: Microsoft cloud services in worldwide outage
Big companies believed to be affected include Amazon's own Alexa, Music and Ring, Netflix, Disney, Discourse (which reported problems with "AWS Route 53, one of our DNS providers," Tinder and Roku.
One developer said "AWS goes down and I spend 2 hour trying to debug why my code is not working," illustrating the extent to which public cloud services are assumed to be up and running.
While it is a serious outage, other regions in general seem to be unaffected, management console aside. There is a common issue with hyperscale services though, which is that while resilience in general is very good, there is a possibility of cascading failures because of service inter-dependencies.
AWS in its status report for the console and for EC2 said that "we have identified the root cause and we are actively working towards recovery," giving hope that the outage will not be long-lived. ®
Updated to add
The outage has been very bad news for the RISC-V team, which is currently hosting a virtual summit.
"We are aware and working closely with the technical team to get this resolved, and will update everyone once it is fully functioning again," a spokesperson told The Register.
"For those already in a session, we recommend not refreshing the screen as this may disconnect your stream. All sessions are recorded and will be available to you on-demand shortly after the virtual event platform is live again."
Smartish vacuum maker iRobot is also reporting services on its app being affected.
"We have executed a mitigation which is showing significant recovery in the US-EAST-1 Region," Amazon said at 1404 PT (2204 UTC).
"We are continuing to closely monitor the health of the network devices and we expect to continue to make progress towards full recovery. We still do not have an ETA for full recovery at this time."
By 1635 PST (0035 UTC) things are looking a lot better it seems, although users aren't out of the woods yet.
"With the network device issues resolved, we are now working towards recovery of any impaired services," AWS reports. "We will provide additional updates for impaired services within the appropriate entry in the Service Health Dashboard."
Update at 06:15 UTC on December 8th to add:
At the time of writing, the AWS Status Page records just one service is experiencing troubles - the Amazon Elastic Container Service housed in Northern Virginia - that mean "task sizes smaller than 4vCPU are less likely to see insufficient capacity errors."
AWS has not explained what went wrong with its network devices.