Top 500 supers – The Dawning of the GPUs
Coiled for the 10 petaflops spring
With the International Super Computing conference underway this week, the Top 500 ranking of the world's most powerful supercomputers is out, and the bi-annual is just starting to be transformed by the advent of cheap flops embodied in graphics co-processing engines from Nvidia and Advanced Micro Devices.
While the 1.76 petaflops "Jaguar" Opteron cluster built by Cray for the Oak Ridge National Laboratory held onto its top spot on the list without any changes since last November, the "Nebulae" cluster made by Dawning for the National Supercomputing Center in Shenzhen in China is nipping at Jaguar's tail with a blade machine that marries Intel's Xeons to Nvidia's GPU co-processors.
As you will see from perusing the Top 500 list, Jaguar is an XT5 massively parallel cluster with a 3D torus interconnect that currently has six-core Opteron 8400 processors and uses Cray's SeaStar2+ interconnect. It has 224,162 cores to deliver a peak theoretical performance of 2.33 petaflops and delivers 1.76 petaflops of sustained performance on the Linpack Fortran matrix math benchmark. Jaguar could be upgraded with the twelve-core XT6 Opteron blades and the new "Gemini" interconnect, which Cray debuted last week as the XE6 super, formerly code-named "Baker" and easily doubling performance.
Thus far, Oak Ridge has not divulged its plans, but is monkeying around with x64 clusters and Nvidia next-generation "Fermi" GPUs. It would be interesting to see what a next-generation "Cascades" super from Cray, using the "Aries" interconnect (a kicker to the just-announced Gemini), Intel Xeon processors (very likely "Sandy Bridge" Xeons with eight or more cores each), and Nvidia GPUs might do in terms of sustained performance. We'll have to wait a few years to see that, and it may be at Oak Ridge and it may not.
But for the moment, China's NSCS is enthusiastically adopting Dawning's TC3600 blade servers, equipped with Intel's six-core X5650 processors and Nvidia's C2050 GPUs. The exact configuration of the Nebulae machine at NSCS was not available at press time, but the TC3600 blade server is a 10U chassis that holds ten two-socket blades. The C2050s are PCI-Express GPU co-processors with 448 cores and 3 GB of their own GDDR5 memory, rated at 515 gigaflops doing double-precision floating point math and 1.03 teraflops doing single-precision. The Top 500 ranking for Nebulae does not provide blade or GPU count, but the word on the street is that it has 4,700 nodes. What the Top 500 does say the machine has 120,640 cores in total for a peak theoretical performance of 2.98 petaflops and 1.27 petaflops sustained running the Linpack test. All of the nodes in the Dawning blade cluster are linked by quad data rate (40 Gb/sec) InfiniBand switches.
The first thing to notice about the Jaguar and Nebulae supers is the difference between peak and sustained performance. For the Cray Jaguar Opteron cluster, 75.5 per cent of the flops contained in the box end up doing real Linpack work, while on the Dawning Xeon-Tesla hybrid, only 42.6 per cent of the peak performance embodied in the CPUs and GPUs actually push Linpack math. So it would seem that the all-X64 machine has the edge, right? Wrong. Jaguar cost around $200m to build and burns around 7 megawatts of juice, while the Nebulae machine probably costs on the order of $50m (that's an El Reg estimate) and burns only 2.55 megawatts of juice.
When you do the math, as far as Linpack is concerned, Jaguar takes just under 4 watts to deliver a megaflops at a cost of $114 per megaflops for the iron, while Nebulae consumes 2 watts per megaflops at a cost of $39 per megaflops for the system. And there is little doubt that the CUDA parallel computing environment is only going to get better over time and hence more of the theoretical performance of the GPU ends up doing real work. (Nvidia is not there yet. There is still too much overhead on the CPUs as they get hammered fielding memory requests for GPUs on some workloads.)
The power efficiency from using math co-processors is, of course, why Los Alamos National Laboratory had IBM build the "Roadrunner" hybrid Opteron-Cell massively parallel super, which marries blades using dual-core Opterons with blades using IBM's PowerXCell 8i co-processors to create what is now a one petaflops sustained super. (A year and a half ago, Roadrunner had a few more nodes, was rated at 1.1 petaflops, and it was the fastest super in the world, but that was before most of the machine was taken out of public view and began its classified nuclear simulations).
Number four on the Top 500 is another Cray Opteron cluster called "Kracken" that is sitting at Oak Ridge, which is owned by the University of Tennessee but operated by the US Department of Energy. Kracken is an XT5 parallel box with 98,928 cores and comprised of AMD's six-core Opteron 8400s; it weighs in at just over 1 petaflops of peak performance and is rated at 831.7 teraflops on the Linpack test. Number five on the list is the "Jugene" BlueGene/P cluster built by IBM for the Forschungszentrum Juelich in Germany, which has 294,912 PowerPC cores and is rated at 825.5 teraflops sustained and has a peak theoretical performance of just over a petaflops, too.
Two other petaflops-class machines are on the list. At number seven is the "Tianhe-1" supercomputer build by the Chinese government and that entered the Top 500 last fall. The box has not changed at all in the past six months. It is comprised of Xeon server nodes using a mix of E5540 and E5450 processors, with each node configured with two of AMD's Radeon HD 4870 graphics cards to be used as co-processors. The machine has 71,680 cores, and it's rated at 563.1 of sustained teraflops and 1.2 petaflops of peak theoretical performance. Again, there's that wide gap between peak and sustained performance with CPU-GPU combos — a gap that has to close. Number six on the list this time around is the "Pleiades" Altix ICE cluster at NASA Ames, which has lower peak performance at 973.3 teraflops, but bests the Tianhe-1 (short for "River in the Sky" in Chinese) on the Linpack test, with 772.7 sustained teraflops of performance.
Rounding out the top ten at number eight is IBM's BlueGene/L super at Lawrence Livermore National Laboratory, which ruled the roost for a number of years, with its 478.2 teraflops of sustained performance, followed by the "Intrepid" BlueGene/P box at Argonne National Laboratory, rated at 458.6 teraflops. Number ten was made by Oracle (well, really Sun Microsystems back when it cared about supercomputing), the "Red Sky" 433.5 teraflops blade super at Sandia National Laboratory. The "Ranger" Sun blade super at the University of Texas, rated at nearly the same speed on Linpack (but with 62,976 cores and a slower interconnect), was pushed down to number eleven.