Oh no, here we go again, groans the internet as AWS runs into IT problems. Briefly this time
If you're wondering why you couldn't access a website, app, or service for about 30 minutes today, this may be it
Amazon Web Services gave everyone a scare today as it once again suffered a partial IT breakdown, briefly taking down a chunk of the web with it. If you found you were unable to use your favorite website or app for a moment today, this may have been why.
Many feared another full-on AWS outage, as we saw earlier this month, was about to kick off. The biz finally admitted on its status page at 0748 PT (1548 UTC) that its US-West-2 region was experiencing connectivity problems, and similarly for US-West-1 at 0752 PT (1552 UTC).
Ten minutes later, it said it had worked out the root cause of the loss of connectivity to the regions, had made some fixes, and was seeing some recovery. And then at 0810 PT (1610 UTC), it declared:
We have resolved the issue affecting Internet connectivity to the US-WEST-1 Region. Connectivity within the region was not affected by this event. The issue has been resolved and the service is operating normally.
The same went for US-West-2 four minutes later. The total outage time was about 30 minutes. The above statement suggests that connections in and out of the region with the rest of the world were affected, and networking within the region was OK.
The exact cause was not spelled out. Perhaps a careless tech tripped over a cable, a backbone ISP somewhere had problems, or it was DNS. It is, after all, always DNS.
The effects of the downtime rippled through the internet in pretty much the same way as the US-East-1 region borkage at the start of this month: people noticing websites and apps hosted by Amazon no longer working as expected. AWS did not immediately respond to our queries regarding today's event.
The web giant's status page became increasingly unresponsive as either (a) netizens flocked to it to find out what had happened to their services or (b) things at AWS became increasingly borked.
- AWS postmortem: Internal ops teams' own monitoring tools went down, had to comb through logs
- Log4j RCE latest: In case you hadn't noticed, this is Really Very Bad, exploited in the wild, needs urgent patching
- AWS wobbles in US East region causing widespread outages
- The big AWS event: 120 announcements but nothing has changed
It's tough timing for the cloud colossus, which has been hard at work over the past week patching its components affected by the Apache Log4j remote-code execution vulnerability (CVE-2021-44228), judging by Amazon's latest security bulletin on the matter.
AWS falling over, however briefly, is a reminder of just how reliant today's apps, websites, and services are on singular platforms like, well, AWS.
Amazon-owned video-streaming darling Twitch also broke down during the AWS connectivity blip. A glance at outage-spotting site Downdetector showed a variety of well-known services –some of which aren't hosted by AWS – experienced issues at the same time as Amazon, including Zoom, Salesforce, Facebook, and Slack. That to us suggests there was some kind of underlying infrastructure issue, perhaps.
Twitter, however, appeared to remain mostly upright. Thank goodness for that. ®
Updated to add:
Amazon has been in touch to say: "Between 0714 PST and 0759 PST, customers experienced elevated network packet loss that impacted connectivity to a subset of internet destinations. "Traffic within AWS Regions, between AWS Regions, and to other destinations on the internet was not impacted. The issue was caused by network congestion between parts of the AWS Backbone and a subset of Internet Service Providers, which was triggered by AWS traffic engineering, executed in response to congestion outside of our network.
"This traffic engineering incorrectly moved more traffic than expected to parts of the AWS backbone that affected connectivity to a subset of Internet destinations. The issue has been resolved, and we do not expect a recurrence."