dTrace guru and famed shouter at hard disk drives Brendan Gregg has cooked up a "microbenchmark" to assess the Linux kernel page table isolation (KPTI) patch for the Meltdown CPU design flaw and come up with predictions of significant-but-manageable performance degradation.
Gregg explained on Friday that his work considers the following five factors to properly assess the overhead the patches create.
- Syscall rate: there are overheads relative to the syscall rate, although high rates are needed for this to be noticable. At 50k syscalls/sec per CPU the overhead may be 2 per cent, and climbs as the syscall rate increases. At my employer (Netflix), high rates are unusual in cloud, with some exceptions (databases).
- Context switches: these add overheads similar to the syscall rate, and I think the context switch rate can simply be added to the syscall rate for the following estimations.
- Page fault rate: adds a little more overhead as well, for high rates.
- Working set size (hot data): more than 10 Mbytes will cost additional overhead due to TLB flushing. This can turn a 1 per cent overhead (syscall cycles alone) into a 7 per cent overhead. This overhead can be reduced by A) pcid, available in Linux 4.14, and B) Huge pages.
- Cache access pattern: the overheads are exacerbated by certain access patterns that switch from caching well to caching a little less well. Worst case, this can add an additional 10 per cent overhead, taking (say) the 7 per cent overhead to 17 per cent.
The post is marvellously detailed and deserves your time, so we shan't summarise all of its points.
Microsoft works weekends to kill Intel's shoddy Spectre patchREAD MORE
Suffice to say Gregg's concluded that the patches will increase overheads, by up to 800 per cent under some circumstances, but that detailed tuning of systems should smooth things out to less-alarming levels.
The variables to watch are syscall numbers, whether the Linux kernel you use supports process-context identifiers (pcid - look for kernel 4.14 or later) and using huge pages so you have fewer pages to track. With the right tweaks, Gregg was able to substantially reduce the overheads the new code brought to Linux.
He's also considered the impact on the AWS infrastructure used by his employer, Netflix, and concluded "between 0.1 per cent and 6 per cent overhead with KPTI due to our syscall rates, and I'm expecting we'll take that down to less than 2 per cent with tuning".
That's still a decent hit and Gregg's also noted that he's not been able to calculate changes to hypervisors or with microcode installed, which may bring further performance penalties. ®