High-performance computing geeks are sweating on a Red Hat fix, after a previous patch broke the Lustre file system.
Red Hat included its own fixes in an August 14 suite of security patches, and soon after, HPC sysadmins found themselves in trouble.
The original report, from Stanford Research Computing Center, details a failure in LustreNet – a Lustre implementation over InfiniBand that uses RDMA for high-speed file and metadata transfer.
The symptoms in the bug report describe a catastrophic networking failure: a system that can't even ping itself, let alone mount filesystems or communicate with other nodes.
Red Hat's advice to the forum is to revert to version 3.10.0-862.11.5.el7 until further notice, and The Register has asked how long it expects the fix to take, and how many users might be affected.
It appears the borkage applied to RDMA, rather than being specific to Lustre.
Red Hat RDMA specialist Don Dutile is quoted in the forum as saying the issue was still under embargo four days ago: “Already reported and being actively fixed. Cannot make this public, as the patch that caused it was due to embargo'd security fix.
“This issue has highest priority for resolution. Revert to 3.10.0-862.11.5.el7 in the mean time”, Dutile's message continued.
He also noted that the issue was duplicated in bug 1616346.
Red Hat has confirmed the bug and is working on a fix, Principal program manager and Red Hat Product Security Assurance team lead Christopher Robinson told The Register in an e-mail.
“The problem will be fixed in kernel-3.10.0-862.13.1 which is currently being reviewed by Red Hat Enterprise Linux Engineering.”
Customers impacted by this bug can revert their kernels to the previous working versions until the fix is publicly available, but if things are more urgent, Robinson said, “customers can request a hotfix kernel by contacting Red Hat Global Support Services.”
The company didn't identify how many HPC sites are affected by the bug, but Robinson said "to date, we have only had a few bugs open tracking this defect".
An Australian HPC sysadmin told The Register he thought most facilities would have discovered the bug the same way his shop did.
“We brought up a test system to try out the new kernel, and noticed that Lustre wasn't working – the Lnet functionality just wouldn't talk to anything.”
“And we couldn't use the command line tools to test RDMA.”
He noted that it's not always easy to find spare kit in an HPC facility, because there's always another workload you could be running if the “spare” kit was in production (he was able to take a node out of the main system to run the test).
As far as he could tell, it wasn't vendor-specific, because it was clearly somewhere higher up in the kernel. And while the issue only affects people running big iron, it's an important bug: “Without RDMA, you don't get the low latency message-passing between nodes. You don't want to be doing that.” ®