Amazon’s Away Teams laid bare: How AWS's hivemind of engineers develop and maintain their internal tech

Cloud giant's structure, staff practices revealed


Deep dive Companies inside and out of Silicon Valley have found their own ways to rapidly develop and deploy features and functionality.

Within the belly of Amazon Web Services, the web giant's gigantic cloud beast, though, is a specific digestive system – a concept called Away Teams – that accepts certain weaknesses to achieve maximum velocity.

El Reg has spent a few months talking to about a dozen people who have lived inside this particular process, and now it's time to share it with you here. Our sources will remain anonymous as they are not authorized to speak publicly about Amazon. Official spokespeople for the US giant declined to comment on our findings.

Capturing the way things are at an organization as large as Amazon is always a challenge. The company has never publicly codified its management system as it has done for its leadership principles. But this picture might offer new ideas for people seeking to coordinate technology development at scale.

The problem at hand

Once your engineers and technical staffers number in the hundreds or thousands, the organization outgrows everything that works at the team level. When the whole mess is in production, some way must be found so those 20, 50, or 100 teams can get help from each other.

Agile, Scrum, and DevOps methods keep a specific project humming and evolving from conception to delivery, but they won't keep the work of a score of teams coordinated.

Creating a coherent design for a platform or application, of course, is a fundamental problem, and so is organizing the projects to implement such a design. But no matter how well you do at first, adjustments are needed.

Every one of those teams was set up to achieve certain objectives. Maybe they have an individual profit and loss (P&L), or Objectives and Key Results (the famous OKRs that Google adopted, inspired by Intel's use of them). But in a modern platform, almost all services that comprise the whole will use each other.

When someone shows up at your cube and asks for a new feature in the service you are offering or to fix a bug or to optimize performance, what do you do? Do you let them have access to your source code? If a new feature is popular with users or customers, do you keep it for your team or give it to the team where it may more naturally belong? If your team could add a capability that would help other teams make more money, should you do that before what is on your approved roadmap?

Anyone who thinks such issues are easily resolved and that everyone will just do the right thing has never worked inside a large organization in the real world.

Of course, good management should intervene to help teams work together. But seeking management attention slows things down. And, surprise, surprise: management doesn’t always make the right decision.

Amazon's system for internal collaboration

Amazon has faced these issues since its inception and has created a system based on the principles of service-oriented architecture (with some significant additions to codify the management innovations that have made Internet companies so successful).

People fight in cartoon cloud. photo by Shutterstock

Amazon consumer biz celebrates ridding itself of last Oracle database with tame staff party... and a Big Red piñata

READ MORE

Andrew Ng, the Stanford researcher, entrepreneur, and AI expert, in a talk at a San Francisco AI conference in 2017, explained that a real internet company was not a shopping mall with a website, but a company that embraced a short cycle time, A/B testing, and pushed down decision making.

Amazon is not re-inventing the wheel here – it's looking at a problem faced by a large number of firms – but it does seem to have found an interesting way to solve the problem. It has a system of optimizing internal collaboration by organizing development around a collection independently managed services with a fascinating set of policies for governing it all based on A/B testing, pushed-down decision making, and a carefully curated culture of collaboration that makes use of a novel concept: Away Teams.

As it turns out, Amazon’s system, especially the Away Teams, aligns with the findings of technology philosophers such as Ray Kurzweill’s explanation of the exponential progress of technology and MIT Professor Eric Von Hippel’s observations about the power of user-driven innovation.

From the Yegge rant to service-oriented collaboration

From what we know of his behavior, Amazon CEO Jeff Bezos is a huge fan of forcing functions, which, from a CEO perspective, are dictates from on high that mandate certain types of change.

Bezos uses his personal magnetism, the aura of his success, and his power as CEO to force the company to transform itself. Forcing Amazon.com to eat its own dogfood and use AWS was one such endeavor. The drive to move Amazon completely off Oracle is another, although the author of that may be Andy Jassy, head of AWS. But my favorite is the move toward service-oriented architecture, recounted in what became known as the Yegge Rant.

As told by Steve Yegge, a Google engineer who had moved to Google after several years at Amazon, around 2002 Bezos demanded that everyone at Amazon make their department’s offering available as services exposed through APIs. Yegge's post (on the now-deprecated GooglePlus) explains that this forcing function caused an ocean of pain as the company learned to address technical and operational issues such as debugging a service-oriented architecture, maintaining adequate performance when every internal user may be a potential unwitting DOS attacker that may spike traffic, handling operational support, discovering what services were available, and lots of other stuff. We should note that Yegge was quickly contrite about the posting.

The forcing function worked as planned, however, and created a technology culture around services that had some interesting principles. One such principle that we have not been able to get multiple sources to verify is the policy that once a team is the only remaining user of an API, they become owners of that service, even if they didn’t initially develop it.

But alone, technology, tools, and operations for a mature service-oriented architecture don’t solve the problem of internal collaboration. Here’s where Amazon broke new ground, especially with the concept of the Away Team. The Register hasn’t heard that Amazon has a name for this system, but service-oriented collaboration seems apt.

Similar topics


Other stories you might like

  • Talos names eight deadly sins in widely used industrial software
    Entire swaths of gear relies on vulnerability-laden Open Automation Software (OAS)

    A researcher at Cisco's Talos threat intelligence team found eight vulnerabilities in the Open Automation Software (OAS) platform that, if exploited, could enable a bad actor to access a device and run code on a targeted system.

    The OAS platform is widely used by a range of industrial enterprises, essentially facilitating the transfer of data within an IT environment between hardware and software and playing a central role in organizations' industrial Internet of Things (IIoT) efforts. It touches a range of devices, including PLCs and OPCs and IoT devices, as well as custom applications and APIs, databases and edge systems.

    Companies like Volvo, General Dynamics, JBT Aerotech and wind-turbine maker AES are among the users of the OAS platform.

    Continue reading
  • Despite global uncertainty, $500m hit doesn't rattle Nvidia execs
    CEO acknowledges impact of war, pandemic but says fundamentals ‘are really good’

    Nvidia is expecting a $500 million hit to its global datacenter and consumer business in the second quarter due to COVID lockdowns in China and Russia's invasion of Ukraine. Despite those and other macroeconomic concerns, executives are still optimistic about future prospects.

    "The full impact and duration of the war in Ukraine and COVID lockdowns in China is difficult to predict. However, the impact of our technology and our market opportunities remain unchanged," said Jensen Huang, Nvidia's CEO and co-founder, during the company's first-quarter earnings call.

    Those two statements might sound a little contradictory, including to some investors, particularly following the stock selloff yesterday after concerns over Russia and China prompted Nvidia to issue lower-than-expected guidance for second-quarter revenue.

    Continue reading
  • Another AI supercomputer from HPE: Champollion lands in France
    That's the second in a week following similar system in Munich also aimed at researchers

    HPE is lifting the lid on a new AI supercomputer – the second this week – aimed at building and training larger machine learning models to underpin research.

    Based at HPE's Center of Excellence in Grenoble, France, the new supercomputer is to be named Champollion after the French scholar who made advances in deciphering Egyptian hieroglyphs in the 19th century. It was built in partnership with Nvidia using AMD-based Apollo computer nodes fitted with Nvidia's A100 GPUs.

    Champollion brings together HPC and purpose-built AI technologies to train machine learning models at scale and unlock results faster, HPE said. HPE already provides HPC and AI resources from its Grenoble facilities for customers, and the broader research community to access, and said it plans to provide access to Champollion for scientists and engineers globally to accelerate testing of their AI models and research.

    Continue reading

Biting the hand that feeds IT © 1998–2022