How does Microsoft mitigate the risk of speculative-execution bugs on its Azure platform? The US goliath is unwilling to comment, despite running a session at its Ignite conference last month on exactly this subject.
The Ignite session itself was titled "Spectre/Meltdown: An Azure retrospective" and talked about "how the computer industry came together to address this new class of vulnerability, and specifically how Azure responded".
Spectre and Meltdown were the first examples of data-leaking side-channel flaws involving speculative execution, and blight CPU cores designed by Intel, AMD, Arm, and others, to varying degrees. Speculative execution is an optimisation technique where the processor executes some likely software instructions in advance, and discards the result if it is not needed. During this process, the CPU will access its caches, or otherwise touch resources on the system, in a way that allows an eavesdropper to gradually discern the contents of memory or registers. It's ingenious stuff, but slow and tricky to exploit in real life.
The risk, though, is that malicious code or rogue logged-in users can potentially access sensitive data belonging to other applications and users across isolation boundaries, even between virtual machines, or between guest machines and the host – a possibility that is particularly alarming for cloud providers.
The session was unusual. The main part was a video describing the build-up to the Spectre and Meltdown reveal in January 2018. The specific problem was discovered by Google's Project Zero in June 2017, but was kept under embargo for six months. Microsoft was among those companies in the know, furiously patching Windows and its Azure platform, before the embargo on disclosure lifted on 10 January 2018. Open-source systems like Linux are patched in the open, though, and changes to the kernel, along with industry sources, tipped off The Register.
The Project Zero team announced at 9.30am the following day that it would disclose the vulnerabilities at 3pm, according to the video, causing Microsoft to assemble a crisis meeting. The Azure patching schedule to address the vulnerability had seven days to run, but now there were only four hours to go. "I don't know the proper collective noun for executives, but we had all them in a conference room," said the narrator. Dramatic music played in the background while Ignite attendees learned how the meeting went silent as Azure EVP Scott Guthrie spoke. "The security of our customers is paramount. Accelerate the rollout."
In a curious blend of technical information and marketing, Microsoft's intention seems to have been to assure attendees of its focus on security. Corporate VP Julia White is heard to say: "We could never ever put our customers at risk and if we broke that promise, why would anyone trust us ever again? It was this moment that to me proved it."
There was also some insight into Microsoft's approach, patching aside. Chief technology officer Mark Russinovich said: "In Azure we've adopted a mindset of assumed breach, which means we assume that hackers are going to get into the infrastructure, that potentially there might even be malicious insiders in Azure."
However, the video had a sting in the tail. Towards the end we heard: "Unfortunately we weren't done. Shortly after we had released this particular set of updates, we were contacted by another researcher who had discovered another issue of very similar style. And then a few days later another researcher, and then another. It turns out that Spectre and Meltdown were just the very beginning of an entire new class of issues."
This is the point hammered home by kernel maintainer Greg Kroah-Hartman, who said in October (more than two years after Spectre and Meltdown were discovered): "These problems are going to be with us for a long time; they're not going away." He also offered a partial solution, for anyone not running in a secure environment where all users are trusted. "Disable hyper-threading. That's the only way you can solve some of these issues. We are slowing down your workloads. Sorry."
Managing the Linux kernel at AWS: 'A large team of security experts' dealing with fallout from Spectre, Meltdown flawsREAD MORE
The downside of disabling hyper-threading is a performance penalty of 20 per cent or more. Following the Ignite session video, a panel took questions so we asked the obvious one. Does Microsoft disable hyper-threading on Azure to protect its customers, as Kroah-Hartman recommends?
Our question was answered by Igal Figlin, partner program director of Azure Compute. When he said "Thank you for your question," the room burst into sympathetic laughter. "We do have, of course, our own hypervisor kernel and we do have extra measures beyond completely disabling hyper-threading. I understand the general notion of saying that, it is like, if you don't go out of the house, you won't be run over by a car. In a realistic world, not if we mitigate the threats, we need to continue to stay vigilant about potential attack vectors and mitigate attack vectors. Our stance is that every known attack vector we see, and this even includes attacks that we were not able to implement but it seems that it might be an attack vector, are being mitigated at every point in time. I don't think we need to go to the extreme measures of disabling functionality."
Figlin added a reference to the "big data that we can analyse" as Microsoft studies attacks on its infrastructure which is helping the company "to make Azure safe and to make Windows hypervisor safe and also to contribute back to Linux. We should be sufficiently safe, and when it becomes not, this is why we have all sorts of measures and processes."
A fair answer, but it would be good to know more. We asked AWS about the same issue and the director of kernel and operating systems answered in detail. Google has detailed information here about mitigation for a number of its services, and here regarding Kubernetes, where it says: "The host infrastructure that runs Kubernetes Engine isolates customer workloads from each other. Unless you are running untrusted code inside your own multi-tenant GKE clusters, you are not impacted." And here, where there is similar information about Compute Engine.
At this time, Microsoft will not be commenting
The issue is complex and one way, for example, that a cloud provider can protect customers is by ensuring that concurrent customer workloads do not run on the same CPU core. In the case of "serverless" services like AWS Lambda, this could be expensive. "We've been very careful to ensure that we never split cores between customers," AWS Elastic Compute Cloud VP Dave Brown told us. "We've never put workloads from different customers on the same VM." For serverless, "running functions from different customers within the same VM and just using processor isolation is absolutely not good enough." Expensive if a customer is making light use of a service? "It absolutely becomes very expensive." This was a factor behind the development of Firecracker, which can boot lightweight VMs quickly.
It is probable Microsoft has an equally strong story to tell for Azure, but when we asked for more details, the answer was: "At this time, Microsoft will not be commenting." Could its rationale be that when it comes to security, information might help attackers? Further, while most Ignite sessions are now available to watch on video, the session we attended has not, as far as we know, been published. That said, the company does publish a detailed document about mitigating risk on Windows.
Speculative-execution bugs are real and there are proof-of-concept demonstrations, but to date there have been few reports of successful and damaging attacks. At the same time, in an age of cyberwarfare it is hard to believe that interested parties are not taking careful note, especially since the risk from these vulnerabilities is information theft. It is a difficult problem, and one that the cloud giants are perhaps better placed to address than smaller hosting providers. It is obvious that the three biggest cloud providers take the issue with extreme seriousness, but despite its video and presentation, Microsoft has been the least forthcoming on the subject. ®