This article is more than 1 year old
Cray mimics Ethernet atop SeaStar interconnect
Linux shortcut cooks with SLES
Preview of things to come
Bolding warns that at the moment this is a "feature release," which means the Cluster Compatibility Mode is really a technology preview. He adds that the clone TCP/IP stack riding atop the SeaStar interconnect "can provide reasonable performance for a relatively small number of nodes," but cautions that on very large XT implementations, customers are going to want to fall back on what Cray is now calling Extreme Scalability Mode - recompiling the Linux applications to have their nodes talk directly through the SeaStar interconnect. CCM can scale to 2,048 cores on the TCP/IP stack now, which means somewhere between 85 and 170 nodes, depending on the Opteron processors customers choose.
Next year, Cluster Compatibility Mode will get a whole lot more interesting, when Cray supports the OpenFabrics Enterprise Distribution (OFED) drivers for InfiniBand much as it is doing for TCP/IP drivers today. One of the key features of InfiniBand that Ethernet still does not have (but soon will) is called Remote Direct Memory Access, which allows server nodes to talk to each other directly, using InfiniBand controllers to link memory controllers, bypassing the network stack entirely and offering much lower latency than even 10 Gigabit Ethernet. In essence, with support of the OFED drivers, Cray's Cluster Compatibility Mode will allow the SeaStar interconnect to emulate InfiniBand and yield much better performance than the emulated TCP/IP stack being offered initially with CLE 3.0.
"Once we have the OFED drivers, we think we can come very close to our native communications speed," says Bolding. No word yet on how far the emulated InfiniBand will scale in terms of processor nodes, but it has to be pretty far to bother to go to the trouble.
Cray has been working on Cluster Compatibility Mode for the past two years, and Bolding admits that this clever network emulation would have been useful for Cray to expand its addressable market. But at the time, Cray was more concerned with breaking the petaflops barrier at the big supercomputing centers like Oak Ridge National Laboratory that are paying the current bills.
Cray has high hopes for Cluster Compatibility Mode. "We think this will take away the fear of getting a Cray system," explains Bolding. "We have removed cost as a concern over the past few years, and when we did, some customers feared that they would end up getting something that was not compatible with other Linux machines."
Well, of course, the customers were right in this regard. But if the OFED drivers running atop SeaStar and emulating InfiniBand work as well as Bolding says they can, this would indeed be another barrier down. Provided the SeaStar interconnect has enough oomph that emulated InfiniBand performs as well or better than the real thing, of course. By the way, the emulated Ethernet and InfiniBand drivers will support multiple MPI stacks, so you are not locked in.
CLE 3.0 will initially only ship on the new XT6 and XT6m parallel supers, which use blade servers based on the brand-new twelve-core "Magny-Cours" Opteron 6100 processors from Advanced Micro Devices. The XT6 nodes were previewed last fall at the SC09 supercomputing trade show; Cray has not said that the XT6 nodes are actually shipping yet in volume.
Later this year, Cray will support CLE 3.0 on the XT5 supers, which are based on an earlier six-core Opteron generation but which are based on the same SeaStar2+ generation of interconnect that was held over for the XT6 nodes. In early 2011, Cray will support CLE 3.0 on XT4 generations of supers, but has no plans to support it on XT3 machines. It is a matter of testing and qualification, which Cray is not going to spend money on with so few of these XT3 machines still in the field. If you want to run an emulated Ethernet-MPI stack on top of an XT3 machine, you have to move up to an XT4 or higher.
Presumably the combination of the upcoming "Gemini" interconnect and the XT6 nodes, which comprise the "Baker" family of Opteron supers machines slated for later this year, will have some sort of hardware assistance for helping speed up the emulated Ethernet or InfiniBand that the Cluster Compatibility Mode offers inside CLE 3.0. Bolding did not say.
CLE 3.0 has a number of other enhancements. First, it includes Oracle's open source Lustre 1.8 clustered file system, and also supports IBM's Global Parallel File System (GFPS) and Panasas clustered file systems. GPFS and Panasas are new; the Cray XTs have been running Lustre since their inception. CLE 3.0 is also designed to scale across 500,000 cores in a parallel cluster, up from a 200,000-core ceiling with CLE 2.0. CLE 3.0 also includes a diagnostic tool called NodeKARE, short for Node Knowledge and Reconfiguration, which makes sure jobs are scheduled to run only on nodes that are behaving themselves and not acting all wobbly.
What Cray has not said is whether or not it will be offering a Cluster Compatibility Mode in conjunction with Microsoft for its XT line of supers. This would be clearly very useful. Although Cray supports Windows HPC Server 2008 on its baby and midrange lineup, this Windows variant is not supported on the massively scalable XT line. But over the long haul, that will have to be a goal for the company, since the point of having an entry and midrange super line is to get he customers and grow them up to full-scale, massively parallel machines as their workloads expand. ®