Managing the Linux kernel at AWS: 'A large team of security experts' dealing with fallout from Spectre, Meltdown flaws
OS director on staying safe and advantages of Amazon's Nitro hypervisor
Interview At the AWS re:Invent conference last week, The Register asked Chris Schlaeger, director of kernel and operating systems, how the cloud giant protects itself and its customers from speculative execution bugs in Intel CPUs.
Schlaeger told us he's responsible "for the lowest layer of the software stack that runs on almost all the servers. We work on things like the Linux kernel, various hypervisors, Xen, KVM, Firecracker if you want to include the VMM [Virtual Machine Manager] as well. And we are heavily involved in the definition of the EC2 [Elastic Compute Cloud] instance types, especially for the accelerated platform."
A couple of months ago, Linux kernel maintainer Greg Kroah-Hartman told us that the infamous Spectre, Meltdown and other MDS (Microarchitectural Data Sampling) bugs would be "with us for a long time," as "more and more of the same types of problems" are discovered.
His advice was to "disable [Intel's] hyper-threading. That's the only way you can solve some of these issues," especially if you are running in a shared environment – like the public cloud.
Woo-yay, Meltdown CPU fixes are here. Now, Spectre flaws will haunt tech industry for yearsREAD MORE
So what does AWS do?
"Greg is a good friend of mine and I'm aware of his statements. One of the things to keep in mind, though, is that he's representing a Linux kernel developer community," Schlaeger told us.
"The Linux kernel is used in a very broad range of systems and the kernel developers have no control over what people are doing with their kernel. In that position, Greg's advice to turn off hyper-threading is a good tool to get you out of most of the trouble in a simple and easy fashion. It doesn't fix every single problem but it gets you a whole lot down the road.
But there's one small problem...
"Unfortunately when you do that, you lose about 30 to 40 per cent of the server's performance. Put yourself into my shoes. That is not a super attractive option. We have customers that have our biggest instance types. The in-memory databases, for example, are scaled to max out the box. If I take away 30 to 40 per cent, I kill their application.
"At our size we have many customers in this situation and it's not financially attractive for us to just turn off hyper-threading.
"That's where you need to look at the fine print of these [vulnerabilities]. They come with a lot of detail. Even the detail that Intel provides is often not enough to understand what is going on, and in which particular situation you are or are not affected by. So the past two years I have a large team of security experts that do nothing else but deal with the fallout. They make sure that in our environment, we are still able to keep it safe without turning off hyper-threading."
Schlaeger added: "It is a daily battle we have to fight. In our environment we well know what we are doing, how we use the hypervisor, how the guests are allocated to the physical cores. We have found a way to keep things safe so there are no side-channels for the existing [issues].
"For example, we've changed the scheduler that was in the Linux kernel as well as in the hypervisor to ensure that guests never run at the same time on a core pair. That did cost a bit of performance but it was mostly manageable for us."
AWS has tried to get its patches upstream but, because they are designed for the narrow use case of AWS services, there have been problems getting them accepted.
"The kernel community wasn't too enthusiastic at taking the patches," Schlaeger told us. "We had a bunch of discussions with Peter Zijlstra [kernel maintainer at Intel] on this. He has some alternative approach[es]. The problem we need to solve is slightly different from what the community needs to solve because they need to cover all possible use cases and make it relatively idiot-proof.
"I'm hopeful that we will find an agreement, because we have very little interest in carrying our own elaborate patches, especially in the kernel scheduler. It's just work that we need to do every time we do a new version of the kernel. Getting it upstream would be the ideal solution for us."
Chris Schlaeger, AWS director of kernel and operating systems
The work AWS has done does not remove responsibility from its customers to take precautions, Schlaeger said. "We did the homework to keep our environment safe, but the kernel inside the instance, or the operating system core – because it also affects Windows – do require changes to the kernel.
"This is not just something that we can solve. We make sure our infrastructure is safe but as the customers have a relatively direct access to the CPU they are affected. What they don't need to worry about is the microcode updates. We apply all the microcode updates for them."
Another factor in the security of the AWS platform is the hypervisor in use. "We have two primary hypervisors. That's the Xen hypervisor and the Nitro hypervisor which is a heavily customized version of KVM," said Schlaeger.
In the Nitro system, a card and security chip means that virtualization resources are offloaded to dedicated hardware and the security model is locked down. Nitro uses a minimal Linux kernel whereas "in the Xen world you have a special instance or domain which is largely a full-blown Linux distribution. It has about 300 RPM packages in there. That is painful to maintain. You isolate it, but should somebody get into this, you have full root access to the system."
There are implications for how the two hypervisors are managed. With Xen, "employees could log in with SSH into a server to do administrative tasks. It's sometimes helpful, but very powerful. I love the old saying, as soon as someone logs in with root permissions to a server, all previous knowledge of everybody else of what this server is like, is destroyed.
True to its name, Intel CPU flaw ZombieLoad comes shuffling back with new variantREAD MORE
"Having so many servers, it's not something we can afford. The Nitro system is designed that there is a very specific API. It's authenticated, we can log it, we know exactly who did what, and what the actions are. I can get some debug information, I can stop an instance, I can start another instance, and that's about it."
Although Nitro seems preferable, Schlaeger said: "We have no intention to switch the systems that are running Xen to Nitro. Nitro isn't just a hypervisor and it's not a one-to-one replacement of Xen."
Are these security efforts reassuring? It is clear that AWS is paying close attention to the Intel issues (and we learned separately that most but not all AWS servers do run Intel processors), but that the extreme recourse of disabling hyper-threading is too expensive to contemplate.
Presuming that AWS has done everything right, it is still down to its customers to consider whether they are running workloads that may be vulnerable to this type of attack and, if so, to take appropriate action. Another plus for serverless, perhaps.
We are not picking on AWS. We have asked Microsoft and Google the same question and will report back soon. ®