Nvidia has staked a large part of its future on the idea that GPUs and their massively parallel architectures can replace CPUs for a big chunk of computational jobs. But parallel programming on one device is tough, across two incompatible devices is very difficult, and across clusters of hybrid machines can be very tricky indeed. That's why Nvidia's CUDA parallel programming environment is probably as important as any chip or Tesla GPU co-processor that Nvidia will ever ship.
Nvidia figured out even before the first Tesla GPU co-processors came to market a few years back that programming tools were going to be the lubricant that got the Teslas moving for HPC and commercial applications. And to its credit, Nvidia has put a lot of work into making hybrid and parallel programming easier than it has been historically. Luckily, some of the techniques that Nvidia has come up to parallelize C and C++ code across the many cores in a GPU can also be applied to the increasingly cored and threaded processors on the market; ditto for the work that The Portland Group has done to get its Fortran compilers to spread their work over GPU and CPU cores.
But the GPU nirvana of transparent and automatic parallelization and optimization of codes running across ceepie-geepie gear is not yet in sight, although Nvidia is taking a few steps closer to it with the CUDA 4.0 release. The software has three key new features: GPUDirect 2.0, unified virtual addressing, and support for the Thrust C++ parallel libraries.
One of the big problems with using CPU-GPU hybrids is that the CPU is in control of everything on the system while the GPU co-processors just hang off the PCI-Express buses, waiting for data to chew on and spit back out to the bus. With GPUDirect 1.0, Nvidia worked with InfiniBand networking adapter makers Mellanox and QLogic to allow for copies of GPU data to be copied out to the system main memory so in the event that one GPU in one server needed data from another server in the cluster, it could go out over the PCI-Express bus and up through the chipset and CPU's memory controller to access that data right there rather than having to go the extra steps of sending a request to the GPU on that second machine and waiting for it to come back through the CPU stack again.
This simple change boosted network communication performance by around 30 per cent, according to Nvidia.
With GPUDirect 2.0, which is embedded in the CUDA 4.0 toolkit, a GPU co-processor in the system has a new driver stack that allows it to talk directly to another GPU on the same system over the PCI-Express bus, getting the system chipset, CPU memory controller, and system main memory out of the loop entirely.
In a future CUDA release, says Sumit Gupta, senior product manager of the Tesla line, the GPUDirect software will be tweaked so GPUs on different servers within a cluster of machines can directly access information from each other over InfiniBand links without getting the CPUs in the act at all with the copies of data into system main memory. So there will be peer-to-peer communication between GPUs over the PCI-Express bus within a system as well as between GPUs linked to each other over InfiniBand links that lash together multiple servers. (Which once again begs the question, what will you need the CPU for? Oh, right, the operating system that holds the C, C++, or Fortran code.)
The Message Passing Interface (MPI) protocol commonly used for clustering x64-based servers together into parallel machines is not able to use the GPUDirect 2.0 functionality yet, according to Gupta, but in the CUDA 4.0 release functionality similar to GPUDirect 1.0 allows for data inside of GPUs to be moved to system memory and be available for MPI collective operations. Modified versions of MPI, such as OpenMPI, can move data from and to the GPU memory over InfiniBand when applications do an MPI send or receive operation.
Another neat feature of the CUDA 4.0 environment is called unified virtual addressing, and it takes the memory space of the system and the memory spaces of the multiple GPUs in the machine and maps them as a single unified address space. Developing applications with prior CUDA toolkits required programmers to maintain pointers to CPU and GPU memories in their code, but now they won't have to do that. CUDA will keep track of what data is stored where. According to Gupta, programs that were written with these pointers will continue to work, but coders working to port applications to CPU-GPU hybrids will now have less work to do. Gupta says that the use of the unified memory scheme will not have an adverse effect on performance.
"This unified virtual addressing is a step towards Denver, which will have a single address space anyway," explains Gupta, referring to the hybrid ARM CPU-Maxwell GPU chip that Nvidia said it was working on back in January.
The Maxwell GPUs are expected to offer 16 times the gigaflops per watt of the current Fermi GPUs when they are delivered in 2013, and some of them will have one or more ARM processors on them so they can be used in servers and PCs. (My guess is Nvidia will do a quad-core ARM chip, but the company has provided little details on its CPU plans.)
The other big feature in CUDA 4.0 is the Thrust C++ library, which is an open source project that Nvidia has been contributing to. The Thrust library is similar to the Standard Template Library (STL) for C++, except it has been tweaked for parallel algorithms and data structures and, in the case of CUDA, to work on GPUs as well as CPUs. With the enhancements in CUDA 4.0, the toolkit will analyze the code and automatically divvy up work between the CPUs and GPUs to get the fastest code path and the best performance for an algorithm. In many cases, of course, the GPU will give the best oomph, such as with sorting algorithms. Gupta says that the Thrust library for C++ running on a GPU can do parallel sorting anywhere from 5 to 100 times faster than the C++ STL library.
CUDA 4.0 has some improvements in its threading model, too. Now a single CPU thread in a system can access all of the GPUs in the system at the same time to dispatch work, and conversely, multiple CPU threads inside of an application can share contexts on a single GPU at the same time rather than having to wait their turn. The updated toolkit has a new GPU binary disassembler, adds support for the cuda-gdb debugger for MacOS clients, and better C++ debugging, including new/delete and virtual functions.
The first release candidate for the CUDA 4.0 toolkit is being announced today, and will be available for download for free on March 4. You have to be a registered developer to get your hands on it. Gupta says that Nvidia expects it will take a six to eight weeks to shake whatever bugs developers find out of the release candidate, and then it will become generally available. Nvidia has a large pool of programmers to pull from to help it harden the CUDA 4.0 code. Through the end of 2010, the company had more than 700,000 cumulative downloads of CUDA tools and estimates that this represents around 100,000 active developers. ®