Spectre/Meltdown fixes in HPC: Want the bad news or the bad news? It's slower, say boffins
MIT Lincoln metalheads broke big iron so you don't have to… oh, you still have to, don't you?
HPC admin? Feeling slighted that all the good Spectre/Meltdown mitigation benchmarks ignore big iron? Fear not, a bunch of MIT boffins are on your side.
Unfortunately, what they found is that network connections, disk accesses, and computational workloads can all be affected by the fixes, whether in the operating system or the microcode.
Within a week of the twin bugs being published, performance has been on everyone's mind – because speculative execution is a long-standing performance feature in microprocessors.
Amazon almost immediately warned of performance hits, echoed quickly by others in the industry. Intel responded that performance impacts would depend on workload. SolarWinds conducted its own tests on AWS, and Netflix thought the damage could be contained.
None of those tests were wholly relevant to HPC workloads, however, so an 18-member team from MIT's Lincoln Laboratory set to work on their own.
Hammering the iron
Meltdown, Spectre: The password theft bugs at the heart of Intel CPUsREAD MORE
The conclusion of their paper probably won't surprise HPC admins who have diligently implemented microcode and operating system patches and seen their machines take a doze. "The impact can be significant on both synthetic and realistic workloads," the team writes, adding that "the performance penalties are difficult to avoid even in dedicated systems where security is a lesser concern."
They conducted their experiments on a real HPC environment: MIT Lincoln's "Green-500" listed TX-Green Supercomputer. The tests were run on Intel Xeon E5-2683 v3 Haswell servers with 256 GB of system RAM, and a 10PB Lustre storage system using Seagate ClusterStor CS9000s.
They used six compilations of the Linux kernel (using the GridOS 26 Red Hat-derived distribution), with GRSecurity kernel enhancements enabled and disabled, bringing the total number of configurations to 10. The specific mitigations tested were using
PAX_MEMORY_UDEREF where GRSecurity was enabled, the
REPTOLINE configuration option, and configurations using the "0x3c" BIOS microcode update.
Get a coffee and sit down, here's a summary of the key benchmark results:
- Network connection establishment – "With all mitigations enabled, the mainline kernel is slowed down by approximately 15 per cent without and 21 per cent with the User-Based Firewall. The GRSecurity-enabled kernel is also slowed down by 15 per cent without, but 67 per cent with the User-Based Firewall."
- Disk access – "With all mitigations enabled, the mainline kernel is slowed down by approximately 50 per cent on local disk and 33 per cent on Lustre. The GRSecurity-enabled kernel is slowed down by 90 per cent on local disk and 33 per cent on Lustre."
- Computationally intensive code – Kernel changes had little effect, "as they make few requests for kernel services". That's the good news. The bad news? "A noticeable slowdown was seen with the microcode updated... These slowdowns were measured at 21 per cent for pMatlab and 16 per cent for TensorFlow for the baseline kernel with all mitigations and 19 per cent and 15 per cent respectively with just the microcode."
Code optimisation can help, the researchers said, but that's not always possible in academic HPC environments, since a lot of workloads are either one-off calculations, or they're supporting immature projects.
The team involved included MIT Lincoln Laboratory's Andrew Prout, William Arcand, David Bestor, Bill Bergeron, Chansup Byun, Vijay Gadepally, Michael Houle, Matthew Hubbell, Michael Jones, Anna Klein, Peter Michaleas, Lauren Milechin, Julie Mullen, Antonio Rosa, Siddharth Samsi, Charles Yee, Albert Reuther, and Jeremy Kepner. ®