Nvidia previews next-gen Fermi GPUs
The supermodels of HPC: hot, and worth it
SC09 Graphics chip maker and soon-to-be big-time HPC player Nvidia raised the curtain a little higher on its next-generation of graphics co-processors at the SC09 supercomputing trade show in Portland, Oregon, this week, and it is arguable that the GPU co-processors aimed at personal supers and massive clusters alike were the star of the show.
The next-generation GPU co-processors were developed under the "Fermi" code-name and the details of which were previewed by El Reg last month and featured in a future hybrid supercomputer deal at Oak Ridge National Laboratory.
(Oak Ridge is, of course, home to the Jaguar massively parallel Opteron-Linux super built by Cray and currently the top of the Top 500 super charts. While Jaguar does not use GPU-co-processors, it very well could before too long; Oak Ridge has been vague about its GPU plans.)
The naming conventions of the video card versions of the Fermi chips will be called the GeForce 300 M line, as we reported at the end of October. And at the SC09 event in Portland, Nvidia announced that it is keeping the Tesla brand for its next generation GPU co-processors for workstations and servers. The Fermi chips will be sold under the Tesla 20 brand, as it turns out.
Flavors - and then some
According to Andy Keane, general manager of Tesla supercomputing at Nvidia, the Tesla 20 cards will come in two flavors and the company will sell co-processor systems that can plug right into HPC clusters and link to servers through PCI-Express 2.0 links - and at around 130 watts. Keane bristles at anyone who claims that a fully burdened heat budget for a server - not just a microprocessor, but its memory controller (if it is not integrated), its chipset, and its memory - will be any lower.
With the Fermi family of GPUs, Nvidia is adding L1 and L2 caches to the co-processors and is putting ECC memory scrubbing on internal GDDR5 video memory on the card as well as accesses to external server memory. This ECC support, as it turns out, is as important as anything else in the chip if you want to sell GPUs to nuke labs.
They can't have memory errors crash an application that may take weeks or months to run and they have to trust the answers they get. (IBM's Cell co-processors, used in the number two "Roadrunner" Opteron-Linux supercomputer installed at Los Alamos National Laboratory, have error correction for their memory. But as far as I can ascertain AMD's Radeon graphics cards and Firestream GPUs do not have ECC.)
The Fermi chip has 512 cores, which is a little more than twice the cores of the first Tesla GPUs. The Fermis bundle 32 cores together into a streaming multiprocessor that has 64 KB of shared L1 cache. All 512 cores have access to a shared 768 KB L2 cache, and they support the IEEE 754-2008 double precision floating point standard.
The Fermi chip can, in theory, address up to 1 TB of memory, but the Tesla C2050 GPU co-processor has 3 GB of GDDR5 memory and double precision floating point performance of 520 gigaflops; it has a list price of $2,499. The Tesla C2070 GPU has 6 GB of GDDR5 memory and is rated at 630 gigaflops; it costs $3,999. The bang for the buck is best with the smaller unit, which weighs in at $4.81 per gigaflops compared to the $6.35 per gigaflops of the faster GPU.
The Nvidia Tesla 20 series appliances cram four GPUs into a 1U form factor, with four links out to server nodes. The S2050 uses the slower C2050 GPUs and is rated at 2.08 teraflops and will cost $12,995 when it ships. That works out to $6.25 per gigaflops, so you are paying an extra $2,299 for the server that wraps around four Tesla 20 GPUs.
A 1U appliance with four of the faster C2070 GPUs delivers 2.52 teraflops of double-precision floating point performance and costs $18,995, or $7.54 per gigaflops. By comparison, a prior-generation C1060 GPU with 240 cores and delivering only 78 gigaflops at double precision cost $1,699 when it started shipping in June, or about $22 per gigaflops. (No one really cares about single precision and I am ignoring it.)
The Tesla 20 GPU co-processors and the appliances based on them will be available in the second quarter of 2010, says Keane. The GeForce graphics cards based on the same GPU chips will start rolling out in the first quarter.
GPU for you, sir
There's some other secret sauce in the Fermi GPUs that are going to get HPC nerds thinking about using GPUs.
For one thing, they will support Nvidia's C++ compiler, not just C. Keane dodged exactly when C++ would be ready, and laughed at the idea that Intel's own C++ or Fortran compilers would be ported to CUDA. The Portland Group said this week at SC09 that it has tweaked its popular Fortran compiler to work within the CUDA parallel programming environment that Nvidia created for its graphics cards and co-processors; this Fortran has been in beta testing for about three months.
There are projects that have pulled the CUDA libraries into the popular Matlab tool as well as the R and Python programming languages, and Java applications have been able to be bound into CUDA environment for about a year. The CUDA tool already snaps into the open source Eclipse and Microsoft Visual Studio development tools.
Another secret sauce also revealed at SC is a set of new InfiniBand and Tesla drivers that InfiniBand chip maker Mellanox and Nvidia have cooked up to streamline the movement of data from the InfiniBand ports, to the CPU's main memory, and then down through the PCI-Express bus to the GPU card.
According to Keane, the way it works now, data comes in over InfiniBand, works its way into main memory and is copied; before it is moved down to the GPU, it is copied again and that copy is what is moved. The driver changes allow for the data moved into memory to be moved down to the GPU in one fell swoop, and on early tests on clusters that use GPUs and InfiniBand together, Nvidia and Mellanox have been able to demonstrate a 30 per cent speedup for applications.
Ideally, said Keane, you want the data to move direction from InfiniBand to the PCI-Express bus and on out to the GPU memory, where the data processing actually takes place. The CPU is relegated to a traffic cop, and only gets data in its memory when the application requires it to do some processing. This capability is not available yet, and Keane didn't say when to expect it, either.
Finally, Nvidia this week released the beta of the CUDA Toolkit 3.0, which exploits the Fermi GPU's features. ®