Microsoft snubs Service Fabric as it plots to switch Teams infrastructure to Kubernetes
Plus, a new Detonation Service and other explosive revelations about easing capacity constraints in lockdown
Microsoft's CTO for Azure has opened up on both the company's response to scaling issues with Teams during the COVID-19 pandemic and future plans to switch to "container-based deployments using Azure Kubernetes Service".
The pandemic put pressure on Microsoft's cloud capacity, and chief techie Mark Russinovich describes in a written post and video how demand for services including Teams, Windows Virtual Desktop and Xbox surged as people endured lockdown – no doubt accounting for issues reported in the UK and elsewhere.
Russinovich notes that Teams daily active users expanded from 32 million earlier this year to 75 million in April. Use of the Windows Virtual Desktop service tripled in four weeks – though as a relatively new service, this will have been from a modest base. Xbox gaming saw a 50 per cent multiplayer increase, a 30 per cent increase in daily peak volumes, and a 50 per cent increase in daily new accounts.
Microsoft has put its energy into scaling Azure to meet the demand.
Building new data centres or even provisioning new servers in existing ones takes too long, so Microsoft took a number of other steps to increase capacity. Some of the measures were about gaining flexibility to spread the load, such as deploying critical microservices to more regions, and discovering that "by redeploying some of our microservices to favour a larger number of smaller compute clusters, we were able to avoid some per-cluster scaling considerations."
Disabling these animations saved an amazing 30 per cent core CPU time on the server, claims Russinovich"
There were also some "purposeful degradations," as Russinovich called them: killswitches for non-essential features. Microsoft turned off the Teams typing indicator – little animated dots that tell you someone is typing – and removed a read-receipt animation, saving a remarkable 30 per cent core CPU time during peak load.
Users annoyed by these animations may wonder why they exist, if they are so expensive. Another optimisation was to stop the mobile Teams app from automatically retrieving next week’s calendar, "which reduced request volume by 80 per cent," he said.
Some Xbox services were moved out of under-pressure regions (like Dublin) to locations such as US East, freeing capacity where it was most needed.
Russinovich discussed the architecture of Teams which he described as microservice-based though since it includes monsters like Exchange and SharePoint the "microservice" concept is getting stretched in some parts of the product. His diagram shows two things at the bottom of the stack: virtual machines, and Service Fabric. Service Fabric is Microsoft's home-grown microservice platform and one of the core services in Azure.
According to Russinovich, the company is now planning to transition to containers and Kubernetes, the microservices platform which originated from Google. Irrespective of the merits of Service Fabric versus Kubernetes, the move will as he notes, "align us with the industry," which has chosen Kubernetes as the de-facto standard.
The decision, he added, is also expected to "reduce our operating costs" and "improve our agility".
Microsoft also intends to "minimize the use of REST and favour more efficient binary protocols such as gRPC." Like Kubernetes, gRPC came out of Google. If you consider these Azure moves alongside the shift to using the Chromium browser engine in Edge, there is now a lot of Google-originated technology at Microsoft.
Getting Teams scaling nicely on Kubernetes sounds challenging but note that Microsoft is also injecting a little extra chaos, "systematically embracing chaos engineering practices to ensure all those mechanisms we put in place to make our system reliable are always fully functional," said Russinovich.
Finally, the CTO for Azure introduced the Detonation service: an anti-malware service that works on links, files and attachments, copying them to a sandboxed "Detonation VM" where they are activated (opened or run) in order to inspect the outcome. ®