AWS postmortem: Internal ops teams' own monitoring tools went down, had to comb through logs

OK, it wasn't DNS, but it was hanging around the scene looking shifty


Amazon has published some additional information for last week's US-East-1 outage that revealed its staffers had to pick their way through log files when the web giant's own monitoring tools were hit.

Amazon seems not to want to reveal much technical detail about its internal systems. That is somewhat understandable; quite likely, a few pundits would be horrified, a few others would scour it for hints for future attack, and the rest of the world would neither understand nor care. Either way, it might put a few customers off, either current or potential.

Broken cloud

AWS wobbles in US East region causing widespread outages

READ MORE

There is an internal AWS network, which hosts some unspecified internal services that are used to create and manage some unspecified internal AWS resources. Some other internal services are hosted on the main AWS network. Amazon doesn't tell the world much about this internal network, but it has multiple links to the outside world and the cloud goliath "scale[s] the capacity of this network significantly" to ensure its high availability. And that is the process that went wrong.

An automatic scaling tool of some kind kicked in to scale one of the internal services – one that runs on the main AWS network – and it went wrong, triggering "a large surge of connection activity."

Basically, it swamped the internal network, and this slowed the internal DNS into uselessness along with Amazon's internal monitoring tools. The poor operators were forced into relying on log files to trace the problem. This sounds appallingly twentieth-century for the harried sysadmins, which at least puts them a couple of centuries ahead of Amazon's warehouse workers.

Although the report refrains from entirely blaming DNS, it seems that moving the internal DNS to another network, which took about two hours, gave the admins enough breathing room to work out what was wrong. It also points out that it was only the AWS internal management network that was overloaded into uselessness, not AWS itself.

Because internal DNS is foundational for all services and this traffic was believed to be contributing to the congestion, the teams focused on moving the internal DNS traffic away from the congested network paths. At 9:28 AM PST [on 7 December 2021], the team completed this work and DNS resolution errors fully recovered.

As we noted last week, us-east-1 is the first and oldest of AWS' 21 regions, and a side-effect of that is that it's where the AWS global console landing page is hosted. As one Reg reader noted: "The AWS console is having problems… That's a major flaw IMHO, where if us-east-1 goes down then the console landing page disappears."

It is richly ironic that a service that allows its customers to spread their workloads around the world doesn't do the same with some of its own core services. It noted in the outage report: "We have also deployed additional network configuration that protects potentially impacted networking devices even in the face of a similar congestion event. These remediations give us confidence that we will not see a recurrence of this issue."

The issue also took down the mega-corp's own Service Health Dashboard, Support Contact Center, and impaired the Amazon Connect service that it runs for customers.

It's a reminder that, as security man Brian Krebs' blog put it recently, "The Internet is Held Together With Spit & Baling Wire." ®

Similar topics

Narrower topics


Other stories you might like

  • Saved by the Bill: What if... Microsoft had killed Windows 95?

    Now this looks like a job for me, 'cos we need a little, controversy... 'Cos it feels so NT, without me

    Veteran Microsoft vice president, Brad Silverberg, has paid tribute to former Microsoft boss Bill Gates for saving Windows 95 from the clutches of the Redmond Axe-swinger.

    Silverberg posted his comment in a Twitter exchange started by Fast co-founder Allison Barr Allen regarding somebody who'd changed your life. Silverberg responded "Bill Gates" and, in response to a question from senior cybersecurity professional and director at Microsoft, Ashanka Iddya, explained Gates' role in Windows 95's survival.

    Continue reading
  • UK government opens consultation on medic-style register for Brit infosec pros

    Are you competent? Ethical? Welcome to UKCSC's new list

    Frustrated at lack of activity from the "standard setting" UK Cyber Security Council, the government wants to pass new laws making it into the statutory regulator of the UK infosec trade.

    Government plans, quietly announced in a consultation document issued last week, include a formal register of infosec practitioners – meaning security specialists could be struck off or barred from working if they don't meet "competence and ethical requirements."

    The proposed setup sounds very similar to the General Medical Council and its register of doctors allowed to practice medicine in the UK.

    Continue reading
  • Microsoft's do-it-all IDE Visual Studio 2022 came out late last year. How good is it really?

    Top request from devs? A Linux version

    Review Visual Studio goes back a long way. Microsoft always had its own programming languages and tools, beginning with Microsoft Basic in 1975 and Microsoft C 1.0 in 1983.

    The Visual Studio idea came from two main sources. In the early days, Windows applications were coded and compiled using MS-DOS, and there was a MS-DOS IDE called Programmer's Workbench (PWB, first released 1989). The company also came up Visual Basic (VB, first released 1991), which unlike Microsoft C++ had a Windows IDE. Perhaps inspired by VB, Microsoft delivered Visual C++ 1.0 in 1993, replacing the little-used PWB. Visual Studio itself was introduced in 1997, though it was more of a bundle of different Windows development tools initially. The first Visual Studio to integrate C++ and Visual Basic (in .NET guise) development into the same IDE was Visual Studio .NET in 2002, 20 years ago, and this perhaps is the true ancestor of today's IDE.

    A big change in VS 2022, released November, is that it is the first version where the IDE itself runs as a 64-bit process. The advantage is that it has access to more than 4GB memory in the devenv process, this being the shell of the IDE, though of course it is still possible to compile 32-bit applications. The main benefit is for large solutions comprising hundreds of projects. Although a substantial change, it is transparent to developers and from what we can tell, has been a beneficial change.

    Continue reading
  • James Webb Space Telescope has arrived at its new home – an orbit almost a million miles from Earth

    Funnily enough, that's where we want to be right now, too

    The James Webb Space Telescope, the largest and most complex space observatory built by NASA, has reached its final destination: L2, the second Sun-Earth Lagrange point, an orbit located about a million miles away.

    Mission control sent instructions to fire the telescope's thrusters at 1400 EST (1900 UTC) on Monday. The small boost increased its speed by about 3.6 miles per hour to send it to L2, where it will orbit the Sun in line with Earth for the foreseeable future. It takes about 180 days to complete an L2 orbit, Amber Straughn, deputy project scientist for Webb Science Communications at NASA's Goddard Space Flight Center, said during a live briefing.

    "Webb, welcome home!" blurted NASA's Administrator Bill Nelson. "Congratulations to the team for all of their hard work ensuring Webb's safe arrival at L2 today. We're one step closer to uncovering the mysteries of the universe. And I can't wait to see Webb's first new views of the universe this summer."

    Continue reading
  • LG promises to make home appliance software upgradeable to take on new tasks

    Kids: empty the dishwasher! We can’t, Dad, it’s updating its OS to handle baked on grime from winter curries

    As the right to repair movement gathers pace, Korea’s LG has decided to make sure that its whitegoods can be upgraded.

    The company today announced a scheme called “Evolving Appliances For You.”

    The plan is sketchy: LG has outlined a scenario in which a customer who moves to a locale with climate markedly different to their previous home could use LG’s ThingQ app to upgrade their clothes dryer with new software that makes the appliance better suited to prevailing conditions and to the kind of fabrics you’d wear in a hotter or colder climes. The drier could also get new hardware to handle its new location. An image distributed by LG shows off the ability to change the tune a dryer plays after it finishes a load.

    Continue reading
  • IBM confirms new mainframe to arrive ‘late’ in first half of 2022

    Hybrid cloud is Big Blue's big bet, but big iron is predicted to bring a welcome revenue boost

    IBM has confirmed that a new model of its Z Series mainframes will arrive “late in the first half” of 2022 and emphasised the new device’s debut as a source of improved revenue for the company’s infrastructure business.

    CFO James Kavanaugh put the release on the roadmap during Big Blue’s Q4 2021 earnings call on Monday. The CFO suggested the new release will make a positive impact on IBM’s revenue, which came in at $16.7 billion for the quarter and $57.35bn for the year. The Q4 number was up 6.5 per cent year on year, the annual number was a $2.2bn jump.

    Kavanaugh mentioned the mainframe because revenue from the big iron was down four points in the quarter, a dip that Big Blue attributed to the fact that its last mainframe – the Z15 – emerged in 2019 and the sales cycle has naturally ebbed after eleven quarters of sales. But what a sales cycle it was: IBM says the Z15 has done better than its predecessor and seen shipments that can power more MIPS (Millions of Instructions Per Second) than in any previous program in the company’s history*.

    Continue reading

Biting the hand that feeds IT © 1998–2022