PaaS + IaaS

This article is more than 1 year old

AWS postmortem: Internal ops teams' own monitoring tools went down, had to comb through logs

OK, it wasn't DNS, but it was hanging around the scene looking shifty

Mon 13 Dec 2021 // 19:40 UTC

Amazon has published some additional information for last week's US-East-1 outage that revealed its staffers had to pick their way through log files when the web giant's own monitoring tools were hit.

Amazon seems not to want to reveal much technical detail about its internal systems. That is somewhat understandable; quite likely, a few pundits would be horrified, a few others would scour it for hints for future attack, and the rest of the world would neither understand nor care. Either way, it might put a few customers off, either current or potential.

AWS wobbles in US East region causing widespread outages

There is an internal AWS network, which hosts some unspecified internal services that are used to create and manage some unspecified internal AWS resources. Some other internal services are hosted on the main AWS network. Amazon doesn't tell the world much about this internal network, but it has multiple links to the outside world and the cloud goliath "scale[s] the capacity of this network significantly" to ensure its high availability. And that is the process that went wrong.

An automatic scaling tool of some kind kicked in to scale one of the internal services – one that runs on the main AWS network – and it went wrong, triggering "a large surge of connection activity."

Basically, it swamped the internal network, and this slowed the internal DNS into uselessness along with Amazon's internal monitoring tools. The poor operators were forced into relying on log files to trace the problem. This sounds appallingly twentieth-century for the harried sysadmins, which at least puts them a couple of centuries ahead of Amazon's warehouse workers.

Although the report refrains from entirely blaming DNS, it seems that moving the internal DNS to another network, which took about two hours, gave the admins enough breathing room to work out what was wrong. It also points out that it was only the AWS internal management network that was overloaded into uselessness, not AWS itself.

Because internal DNS is foundational for all services and this traffic was believed to be contributing to the congestion, the teams focused on moving the internal DNS traffic away from the congested network paths. At 9:28 AM PST [on 7 December 2021], the team completed this work and DNS resolution errors fully recovered.

As we noted last week, us-east-1 is the first and oldest of AWS' 21 regions, and a side-effect of that is that it's where the AWS global console landing page is hosted. As one Reg reader noted: "The AWS console is having problems… That's a major flaw IMHO, where if us-east-1 goes down then the console landing page disappears."

It is richly ironic that a service that allows its customers to spread their workloads around the world doesn't do the same with some of its own core services. It noted in the outage report: "We have also deployed additional network configuration that protects potentially impacted networking devices even in the face of a similar congestion event. These remediations give us confidence that we will not see a recurrence of this issue."

The issue also took down the mega-corp's own Service Health Dashboard, Support Contact Center, and impaired the Amazon Connect service that it runs for customers.

It's a reminder that, as security man Brian Krebs' blog put it recently, "The Internet is Held Together With Spit & Baling Wire." ®

Topics

Special Features

Vendor Voice

Resources

PaaS + IaaS

AWS postmortem: Internal ops teams' own monitoring tools went down, had to comb through logs

OK, it wasn't DNS, but it was hanging around the scene looking shifty

AWS wobbles in US East region causing widespread outages

More about

More about

Narrower topics

Broader topics

More about

More about

More about

Narrower topics

Broader topics

TIP US OFF

Other stories you might like

AWS must pay $525M to cloud storage patent holder, says jury

US-EAST-1 region is not the cloudy crock it's made out to be, claims AWS EC2 boss

UK govt office admits ability to negotiate billions in cloud spending curbed by vendor lock-in

Protecting distributed branch office environments from ransomware

AWS severs connection with several hundred staff

Irish power crunch could be prompting AWS to ration compute resources

Amazon to lure upstarts with $500K in AWS AI credits each

GenAI will be bigger than the cloud or the internet, Amazon CEO hopes

Stability AI reportedly ran out of cash to pay its bills for rented cloudy GPUs

Microsoft hiring Inflection team triggers interest from EU's antitrust chief

Amazon finishes pumping $4B into AI darling Anthropic

EU antitrust cops probe Microsoft ties between Entra ID and 365 services

About Us

Our Websites

Your Privacy