This article is more than 1 year old
DOE doles out cash to AMD, Whamcloud for exascale research
Pushing compute, memory, and I/O to the limits
The US Department of Energy used its massive budget to push supercomputers to gigaflops, teraflops, and petaflops in the prior three decades and it is being tasked to put the pedal to the exaflops metal before the end of this decade.
To get there, the DOE has to fund primary research at IT vendors who might otherwise not get around to it until it suited their own commercial needs. It has to also foster collaboration across vendors who might otherwise rather not share ideas, because no one vendor is going to be able to solve the exascale problem by itself.
The main vehicle for funding exascale computing is called the Extreme-Scale Computing Research and Development program, which is being funded by both halves of the DOE. That would be the Office of Science, which funds scientific research in the nuke labs, and the National Nuclear Security Administration, which runs simulations to make sure the US military's nuclear warheads work since Uncle Sam can't set one off thanks to the Nuclear Test Ban Treaty. There is talk that the supers at the DOE labs aren't just making sure existing nukes work, but also helping to redesign them.
The first phase of the DOE's exascale system funding is called FastForward, which is being administered by Lawrence Livermore National Laboratory in conjunction with the six other primary DOE nuke labs (some of which dislike being called nuke labs even though they do nuclear physics research).
Those other DOE labs, along with LLNL, are the name brands in high performance computing in the United States: Argonne National Laboratory, Lawrence Berkeley National Laboratory, Los Alamos National Laboratory, Oak Ridge National Laboratory, Pacific Northwest National Laboratory, and Sandia National Laboratories.
The FastForward exascale research program issued its request for proposals on March 29, and asked that they be submitted by May 11. The program seeks to fund basic research in exascale computing as it relates to three areas: Memory, processors, and storage and I/O.
It has an explicit goal of trying to solicit cooperation across multiple companies, much like the US Defense Advanced Research Project Agency's Ubiquitous High Performance Computing program. In a way, the UHPC program at DARPA is the trailblazer for the FastForward program at DOE.
DARPA always first to fight
The UHPC program was announced in March 2010 with the goal of creating an HPC system that by 2018 can do 50 gigaflops per watt (BlueGene/Q, the current top performer and most efficient super in the world, can do a little more than 2 gigaflops per watt) and pack 10 petabytes of storage and do around 3 petaflops of number crunching into a slight larger server rack than is standard and within a 57 kilowatt power budget.
Building an exascale system would seem easier, by comparison, since there is, in theory, no limit on the size of the machine or its power budget. But in reality, there are big-time power limits on exascale supers because no one is going to build a 20 megawatt nuclear or coal power station to keep one fed and cooled.
In August 2010, two teams were awarded UHPC ExtremeScale contracts with a total of $74m: one lead by Nvidia and the other Intel. Nvidia got a $25m grant and has teamed up with Cray, Oak Ridge National Lab, and six universities. Intel teamed up with three universities, SGI, Lockheed Martin, Cray, Reservoir Labs, and ET International to take down a $49m grant.
In three related UHPC grants, Sandia National Lab has teamed up LexusNexus and two universities, MIT has its own grant, and so does Georgia Tech, apparently. Total funding for the UHPC effort is said to be on the order of $100m, but DARPA has never confirmed that figure.
Three steps to DOE-sponsored exascale computing
With the FastForward program, the DOE is setting a cap of $20m on any proposals to try to encourage focused work on specific problems, and said at the get-go that what it was looking for was more like two $10m proposals in each of the three areas of primary research.
It is not clear how many awards have been made yet – the vendors are not notified of who was bidding and who won, but rather that they won. At the moment, AMD has been awarded a FastForward contract for processor and memory research and Whamcloud has one contract for storage and I/O research. There could be – and probably will be – others getting grants. Uncle Sam likes to hedge its HPC bets.
Once the primary research on possible exascale technologies is completed over the next two years, DOE will be looking at funding vendors to put together prototypes – this is tentatively called the system design phase – and then, by 2020, to build full exascale systems based on those prototypes – known as the system build phase at the moment. DOE will no doubt come up with other names later on.
According to the statement of work (PDF) for the FastForward contract, the issues that vendors face on the exascale challenge are daunting.
On a current petaflops-class system today, it costs somewhere between $5m and $10m to power and cool the machine today, and extrapolating to an exascale machine using current technology, even with efficiency improvements, you would be in for $2.5bn a year just to power an exascale beast and you would need something on the order of 1,000 megawatts to power it up. That's 50 nuclear reactors, more or less. The DOE has set a target of a top juice consumption at 20 megawatts for an exascale system.
Using DDR3 main memory today, a 2 petaflop machine with 2PB of main memory burns about 1.25 megawatts, and assuming that we can get to DDR5 main memory by 2020, we're talking about needing 260 megawatts just for the memory subsystem in an exascale box. Even if you cut the memory-to-flops ratio by a factor of five, which many people don't think is a good idea, and you are above 50 megawatts just for the memory subsystems across a cluster.
In addition to power consumption, memory components are not getting as cheap as CPU components, and memory bandwidth is not keeping up with the ever-increasing core count on processors and thus memory latencies are increasing.
There are resiliency issues with all of the components in an exascale system, which will have large numbers of components frying all the time. And then you are going to have billions of compute elements, and there has to be a hierarchy of memory and interconnects to keep them all fed and communicating with each other as simulations run.
Worse still, programming these petaflops machines is a complete bitch, and an exaflops system will be in the range of old battle-axe mother-in-law. Beyond that, you are programming against Death.
On the processor front, during the FastForward phase, the DOE is looking to better measure and control the power use in processors and integration with memory, network, and optics from the CPU or hybrid CPU-GPU chip, as the case may be. On its wish list, the DOE wants automatic rollback after faults or synchronization errors and better fault detection and correction.
Boosting the movement of data onto and off of the chip is also key, as is handling collective operations across compute elements, and software-controlled placement of data on the chip and its memory hierarchy is also penciled in. Putting network interfaces on the processor is a requirement, and so is boosting the concurrency across many cores and many threads on the cores.
The compute elements of the FastForward potion of the project have to provide 50 gigaflops per watt at scale – that's the same level of performance per watt that DARPA is looking for its ExtremeScale UHPC project. The system has to have a mean time between application failure of six days or larger.
This doesn't sound so great until you realize the system will have trillions of components and that today, with petaflops-class machines, it is on the order of one to five days and, without check-pointing or other resilience mechanisms will drop to about six hours by 2020.
DOE would like to have compute nodes with more than 10 teraflops of double-precision number-crunching performance, 4TB/sec of aggregate memory bandwidth and more than 100GB of main memory; something on the order of 32GB to 640GB is preferred. Total bandwidth between a node and the interconnect that lashes them together should be in excess of 400GB/sec.