Netflix reveals massive migration to new mix of microservices, asynchronous workflows and serverless functions

Goes deep on Docker and adopts ‘strangler fig’ pattern to replace legacy platform

Netflix has revealed it’s built a new media ingestion and distribution platform and expects to spend much of 2021 on a migration from what it describes as a “large and complicated legacy system”.

As detailed in the streaming firm’s tech blog, the new platform is called “Cosmos” and is the fourth generation of a tool that is used to “process incoming media files from our partners and studios to make them playable on all devices.”

The previous generation of the platform, called “Reloaded”, has been in use for seven years but was showing its age.

“When Reloaded was designed, we were a small team of developers operating a constrained compute cluster, and focused on one use case: the video/audio processing pipeline,” wrote Netflix senior software engineer Frank San Miguel. “As time passed the number of developers more than tripled, the breadth and depth of our use cases expanded, and our scale increased more than tenfold, the monolithic architecture significantly slowed down the delivery of new features.”

'Strangler fig' lets the new system grow around the old one and eventually replace it completely

“We could no longer expect everyone to possess the specialized knowledge that was necessary to build and deploy new features. Dealing with production issues became an expensive chore that placed a tax on all developers because infrastructure code was all mixed up with application code. The centralized data model that had served us well when we were a small team became a liability.”

Netflix is famously all-in on cloud and microservices, two technologies suggested as paragons of agility. Yet the quotes above do read rather like an admission the company nonetheless accrued technical debt and has a legacy system to replace.

San Miguel’s post describes Cosmos services as “not a microservice but there are similarities.”

His definition of a microservice is “an API with stateless business logic which is autoscaled based on request load … provides strong contracts with its peers while segregating application data and binary dependencies from other systems.”

DaaS-appearing trick: Netflix teases desktops-as-a-service product


“A Cosmos service retains the strong contracts and segregated data/dependencies of a microservice, but adds multi-step workflows and computationally intensive asynchronous serverless functions.” San Miguel wrote.

The developer said that in a typical Cosmos service “clients send requests to a Video encoder service API layer. A set of rules orchestrate workflow steps and a set of serverless functions power domain-specific algorithms. Functions are packaged as Docker images and bring their own media-specific binary dependencies (e.g. Debian packages). They are scaled based on queue size, and may run on tens of thousands of different containers. Requests may take hours or days to complete.”

Cosmos has subsystems, among them one named “Stratum - a serverless layer called for running stateless and computational-intensive functions.”

The post explains a fair bit about how Cosmos prioritises workloads and allocates resources, and how Netflix’s colossal AWS rig is configured to make sure developers and apps have the power they need when they need it.

San Miguel also reveals that Netflix started work on Cosmos in 2018, has used it in production since 2019 and now uses it for around 40 services. The company has chosen the “strangler fig” migration pattern as it “which lets the new system grow around the old one and eventually replace it completely.”

“2021 will be a big year for Cosmos as we move the majority of work from Reloaded into Cosmos, with more developers and much higher load,” San Miguel concludes. “We plan to evolve the programming model to accommodate new use cases. Our goals are to make Cosmos easier to use, more resilient, faster and more efficient.” ®

Biting the hand that feeds IT © 1998–2021