Hate to interrupt you, what's this got to do with QEMU-KVM?
The one thing you just should never do is blame your compiler or kernel or microprocessor when your code bombs. Your carefully crafted source, like a teenager's first poem to their first crush, is an extension of your essence, your passion to do things right. When it goes wrong, though, 99.999 per cent of the time it's because you suck, and any time spent blaming the toolchain or CPU is time not spent fixing your own work.
Well, here's one of those rare moments where you can blame someone else.
In the background to all of this, Google security engineer Robert Święcki had privately disclosed to AMD engineers and the Linux kernel security team a strange kernel "oops" that, in the words of Linux kernel chief Linus Torvalds, "turned out to be a AMD microcode problem with NMI delivery."
Święcki had reported a similar exception to Slaby's GDB crash: the kernel had tried to execute code in memory that was off limits. That sort of fault will make the hairs on the back of a security engineer's neck stand up: if a hacker can control or even simply influence where the CPU ricochets off to in kernel mode, she can potentially hijack the whole computer. It's the sort of bug that you have to get to the bottom of.
"I'm actually starting to suspect that it's an AMD microcode bug that we know very little about," Torvalds said, referring to Slaby's GDB prang. "There's apparently register corruption (the guess being from NMI handling, but virtualization was also involved) under some circumstances.
"We do have a reported 'oops' on the security list that looks totally different in the big picture, but shares the exact same 'corrupted stack pointer register state resulting in crazy instruction pointer, resulting in NX fault' behavior in the end."
In other words, Slaby had stumbled across an AMD microcode issue on production hardware, an issue that Święcki and the Linux kernel security team were already investigating.
NMIs are interrupts that absolutely must be handled by the kernel and cannot be ignored: you can't tell the chipset to postpone them because they are typically generated by a hardware failure or a watchdog timer raising an alarm. Like almost all interrupts, they can potentially fire at any time. Perhaps an NMI delivery problem occurred during the doomed GDB test; a microcode bug meddling with the stack pointer in an innocuous kernel function during a process scheduling operation that spiraled into a serious exception in the host kernel.
The other ingredient in this saga is virtualization: the OpenSUSE build server was compiling GDB and testing it in a QEMU-KVM virtual machine. That means an unprivileged user in a guest virtual machine merely building software was able to trigger an "oops" in the host server's kernel. That's not good.
According to Święcki, the microcode glitch mostly interferes with the host kernel's stack pointer RSP, but it can also corrupt the contents of other registers – all of which can cause crashes, unpredictable behavior, or potentially be exploited to gain control of the system. The Googler said he can, in "rare" conditions, commandeer the host machine's kernel from a virtual machine guest.
"The visible effects are, in about 80 per cent of cases, incorrect RSP [values] leading to bad returns into kernel data or [triggering] stack-protector faults," Święcki told the Linux kernel mailing list.
"But there are also more elusive effects, like registers being cleared before use in indirect memory fetches.
"I can trigger it from within QEMU guests, as non-root, causing bad RIP [instruction pointer register values] in the host kernel. When testing, a couple of times out of maybe 30 'oopses', I was able to set it to user-space addresses mapped in the guest. It greatly depends on timing, but I think with some more effort and populating the kernel stack with guest addresses it'd be possible to create a more reliable QEMU guest to host ring-0 escape."
My proof-of-concept code [to trigger the bug] works only under QEMU-KVM. Xen and KVMtools don't appear to be affected by it because there's some missing functionality in them that my PoC makes use of. But another thread started on [the Linux kernel mailing list] made me think those hypervisors can also be affected, although that's just speculation.
AMD told The Register the bad microcode – 0x6000832 and 0x6000836 – affects the Opteron 6200 and 6300 series, although Święcki believes the problem extends to newer AMD FX and Opteron 3300 and 4300 chips using the Piledriver architecture and the buggy microcode.
Specific details on how to trigger the bug have not been disclosed ahead of the updated microcode's release. Not that you need to know exactly how to exploit the vulnerability: you could be unlucky like the SUSE team and encounter it randomly on a live system.
Finally, Święcki pointed to a similar bug VMware has worked around in its ESXi hypervisor software for AMD Opteron 6300 CPUs. "Under a highly specific and detailed set of internal timing conditions, the AMD Opteron Series 63xx processor may read an internal branch status register while the register is being updated, resulting in an incorrect RIP. The incorrect RIP causes unpredictable program or system behavior, usually observed as a page fault," reads the VMware note, issued last year.
It's no secret that microprocessors – especially today's complex CPUs with billions of transistors – have bugs. Intel and AMD both publish hundreds of pages of notes warning of subtle flaws in their designs. Most of the cockups are harmless to normal users, some are not; operating systems can work around the engineering blunders, or not bother at all for bugs that are benign. Sometimes, though, new microcode is needed. AMD last issued new microcode for its x86 processors in December 2014. ®
PS: Here's a video of David Kaplan, a hardware security architect at AMD, explaining how you'd typically go about testing and debugging a modern x86 CPU.