This article is more than 1 year old
Inside Microsoft's Autopilot: Nadella's secret cloud weapon
Redmond man spills the beans on Microsoft's top-secret software
Juggler, puppeteer, plate-spinner, watchdog
Scheduling involves juggling different applications so as to provide guaranteed performance for tier-one applications – Azure workloads from paying customers, for example – while "compressing" lower-priority workloads – batch processing jobs for internal Microsoft projects, for example – to create capacity.
"If you think about an operating system on a computer, you're doing preemptive scheduling – running multiple apps and timeslicing into the environment," Neil says. "In this we're working through the bin packing problem – it's a very classic problem, no easy answer to it – an NP-hard issue."
Neil wasn't able to give further information on the precise characteristics of Autopilot's scheduler, but a recent academic paper by Microsoft Research indicates the company is planning to introduce a way to more efficiently schedule compressible workloads in an automated manner. (There's also evidence that Microsoft's internal multi-exabyte "COSMOS" store uses a scripting language called "SCOPE" for analytics-specific scheduling.)
This scheduling component means that Autopilot, along with being a puppeteer, is also a plate-spinner.
Autopilot: the first software the servers in this ITPAC will meet when they arrive at a Microsoft data center
And just like systems in use at Google (Borg and its successor Omega), and Twitter (Mesos), Autopilot's complexity makes it behave more like a skilled yet uncommunicative colleague than a subserviant system.
"The thing you have to get comfortable with is you're relinquishing a lot of control to this system and allowing it to do the right thing for you, and trusting it – it may take steps you don't know about," Neil says. "These systems are so large that no one person is keeping track. That's what the system is designed to do – take care of the details."
Autopilot also gathers large amounts of data to help Microsoft analyze its own infrastructure and identify probblems.
"We have all the information about processor loading, memory loading," explains Neil. "A common thing that people don't sort of grok is that you have a physical machine and it has a set of capabilities, and it's really the first one you run out of that's important. You might have an application that runs out of memory first, so understanding that allows us to optimize for choke points."
Though the service includes things such as usage metrics around CPU, memory, network, disk, and so on, Neil says "we have learned that having an end-to-end test path that is continuously monitored gives a much better result. So as an example, we can do a search query, verify we get a valid result, and look at the latency of that result to see if it is within our expected bounds. We call these watchdogs. These can trigger automated remediation or cause us to roll back a partial deployment to a previous version."
In this way, it apparently differs from Google's systems, which are thought to gather more detailed metrics via an advanced technology named CPI2 that lets Google isolate performance issues on single tasks running on single processors and selectively throttle them.
The power Autopilot gives Microsoft is vast, as it helps increase the efficiency with which the company harnesses its billions of dollars worth of computers. As Microsoft shifts to being a "devices and services" company under cloud-expert Nadella, the importance of Autopilot will only grow over time as Redmond seeks to lash more of its digital universe together. With Autopilot, Neil thinks Microsoft has "the operating system for this new cloud world."
We're sure Nadella hopes the same. ®