AWS postmortem: Internal ops teams' own monitoring tools went down, had to comb through logs

OK, it wasn't DNS, but it was hanging around the scene looking shifty

Amazon has published some additional information for last week's US-East-1 outage that revealed its staffers had to pick their way through log files when the web giant's own monitoring tools were hit.

Amazon seems not to want to reveal much technical detail about its internal systems. That is somewhat understandable; quite likely, a few pundits would be horrified, a few others would scour it for hints for future attack, and the rest of the world would neither understand nor care. Either way, it might put a few customers off, either current or potential.

Broken cloud

AWS wobbles in US East region causing widespread outages

READ MORE

There is an internal AWS network, which hosts some unspecified internal services that are used to create and manage some unspecified internal AWS resources. Some other internal services are hosted on the main AWS network. Amazon doesn't tell the world much about this internal network, but it has multiple links to the outside world and the cloud goliath "scale[s] the capacity of this network significantly" to ensure its high availability. And that is the process that went wrong.

An automatic scaling tool of some kind kicked in to scale one of the internal services – one that runs on the main AWS network – and it went wrong, triggering "a large surge of connection activity."

Basically, it swamped the internal network, and this slowed the internal DNS into uselessness along with Amazon's internal monitoring tools. The poor operators were forced into relying on log files to trace the problem. This sounds appallingly twentieth-century for the harried sysadmins, which at least puts them a couple of centuries ahead of Amazon's warehouse workers.

Although the report refrains from entirely blaming DNS, it seems that moving the internal DNS to another network, which took about two hours, gave the admins enough breathing room to work out what was wrong. It also points out that it was only the AWS internal management network that was overloaded into uselessness, not AWS itself.

Because internal DNS is foundational for all services and this traffic was believed to be contributing to the congestion, the teams focused on moving the internal DNS traffic away from the congested network paths. At 9:28 AM PST [on 7 December 2021], the team completed this work and DNS resolution errors fully recovered.

As we noted last week, us-east-1 is the first and oldest of AWS' 21 regions, and a side-effect of that is that it's where the AWS global console landing page is hosted. As one Reg reader noted: "The AWS console is having problems… That's a major flaw IMHO, where if us-east-1 goes down then the console landing page disappears."

It is richly ironic that a service that allows its customers to spread their workloads around the world doesn't do the same with some of its own core services. It noted in the outage report: "We have also deployed additional network configuration that protects potentially impacted networking devices even in the face of a similar congestion event. These remediations give us confidence that we will not see a recurrence of this issue."

The issue also took down the mega-corp's own Service Health Dashboard, Support Contact Center, and impaired the Amazon Connect service that it runs for customers.

It's a reminder that, as security man Brian Krebs' blog put it recently, "The Internet is Held Together With Spit & Baling Wire." ®

Similar topics

Similar topics

Similar topics

TIP US OFF

Send us news


Other stories you might like