Intel stretches HPC dev tools across chubby clusters
Cluster Studio XE ready for MICs, not for GPUs
SC11 Supercomputing hardware and software vendors are getting impatient for the SC11 supercomputing conference in Seattle, which kicks off next week. More than a few have jumped the gun with product announcements this week, including chipmaker Intel.
No, Intel is not going to launch its "Sandy Bridge-EP" Xeon E5 processors, which are expected early next year. But the new Cluster Studio XE toolset for HPC customers will help those lucky few HPC and cloud shops that have been able to get systems this year to squeeze more performance out of their Xeon E5 clusters.
The Cluster Studio XE stack includes a slew of Intel tools for creating, tuning, and monitoring parallel applications running on x86-based parallel clusters. Intel had already been selling a set of application tools called Cluster Studio, which bundled up the chip giant's C, C++, and Fortran compilers, its rendition of the message passing interface (MPI) messaging protocol that allows server nodes to share work, and various math and multithreading libraries to goose the performance of applications.
With the XE (Extended Edition) of the HPC cluster tools, Intel is goosing the performance of the MPI library, and claims its MPI 4.0.3 stack is anywhere from 3.3 to 6.5 times as fast as the OpenMPI 1.5.4 and MVAPICH2 1.6 MPI stacks from the open source community. Benchmark tests were done on a 64-node system running 768 processes and linked by InfiniBand switches.
Intel tested the Platform Computing MPI 8.1.1 stack against the three MPI stacks listed above, only this time on an eight-mode system; in this case the performance differences between Intel and Platform (which is now owned by IBM) were not huge. With the Microsoft MPI 3.2 stack on the same iron, the Intel MPI stack running on Windows servers was anywhere from 2.17 to 2.74 times faster than the Microsoft MPI.
The updated Intel MPI stack can scale to over 90,000 MPI cores, and also has hooks into the open source SLURM job scheduler that was created by Lawrence Livermore National Laboratory because of its frustration with closed-source job schedulers and the state of the open source ones.
With the Cluster Studio XE roll-up, the Inspector and Debugger modules now have cluster-level data gathering and reporting, instead of just seeing things at a node level. What this means, in plain American, is that these add-ons to the compilers can look for memory leaks and threading errors across a cluster of machines without sending the HPC application programmer on a wild goose chase to locate performance issues or crashes on an individual node. (With 90,000 cores, which is 5,625 nodes using the future eight-core Xeon E5 processors, you can't look for these issues manually.)
The Trace Analyzer and Collector module can now look at MPI performance across the nodes in a cluster and evaluate how well MPI is load balancing across the nodes. The VTune Amplifier, which is a tool that Intel uses to virtualize the threading behavior in a single node, can now show threading issues across the cluster.
The Cluster Studio XE bundle includes the Intel v12.1 compilers that were launched in September, which offered between 22 and 27 per cent better performance on Fortran benchmarks and from 6 to 11 per cent on C/C++ integer performance compared to the v12.0 releases running on Linux and Windows machines. C/C++ floating performance improvements were a few points. Intel claims it has a considerable performance advantage over other compilers – anywhere from 21 to 47 per cent faster code execution on C, C++, and Fortran tests. And that performance is not just tied to Intel's own Xeon processors.
Perhaps more significantly, on Fortran, Intel now believes it has the performance edge over Portland Group 11.4 and Absoft 11.1 on either Windows or Linux machines. The performance jump is particularly acute on Windows machines running C++.
"We believe that we have the best performance, regardless of the type of x86 chip," James Reinders, evangelist for Intel's software division, tells El Reg.
The v12.1 compilers are tuned up for the forthcoming Xeon E5 processors, and even though Intel has not been able to get its hands on machines using AMD's impending "Interlagos" Opteron 6200 processors to tune and test them, Reinders says that he is confident that the compilers and the Cluster Studio XE tools will wring more flops out of these AMD chips than the alternatives.
The interesting twist in all this is that the Cluster Studio compilers and tuning and visualization tools cannot peer into GPU coprocessors, and Reinders says he is not even sure how Intel would go about doing that. But because the future "Knights" x86-based coprocessors are based on the same architecture as Intel and AMD chips, Cluster Studio XE tools will be able to see into these MIC coprocessors and help coders tweak and tune their apps for them.
The normal Cluster Studio stack, which includes the Intel compilers as well as the math and clustering libraries, costs $1,849 per developer on a Linux workstation and $1,499 per developer on a Windows workstation. There is no runtime or royalty charge for having the tools run on a parallel x86 cluster. If you want to go all the way to the Cluster Studio XE stack, then you pay $2,849 per developer on Linux and $2,499 on Windows. Yes, the Windows versions are cheaper. ®