DOE doles out cash to AMD, Whamcloud for exascale research

Pushing compute, memory, and I/O to the limits

The US Department of Energy used its massive budget to push supercomputers to gigaflops, teraflops, and petaflops in the prior three decades and it is being tasked to put the pedal to the exaflops metal before the end of this decade.

To get there, the DOE has to fund primary research at IT vendors who might otherwise not get around to it until it suited their own commercial needs. It has to also foster collaboration across vendors who might otherwise rather not share ideas, because no one vendor is going to be able to solve the exascale problem by itself.

The main vehicle for funding exascale computing is called the Extreme-Scale Computing Research and Development program, which is being funded by both halves of the DOE. That would be the Office of Science, which funds scientific research in the nuke labs, and the National Nuclear Security Administration, which runs simulations to make sure the US military's nuclear warheads work since Uncle Sam can't set one off thanks to the Nuclear Test Ban Treaty. There is talk that the supers at the DOE labs aren't just making sure existing nukes work, but also helping to redesign them.

The first phase of the DOE's exascale system funding is called FastForward, which is being administered by Lawrence Livermore National Laboratory in conjunction with the six other primary DOE nuke labs (some of which dislike being called nuke labs even though they do nuclear physics research).

Those other DOE labs, along with LLNL, are the name brands in high performance computing in the United States: Argonne National Laboratory, Lawrence Berkeley National Laboratory, Los Alamos National Laboratory, Oak Ridge National Laboratory, Pacific Northwest National Laboratory, and Sandia National Laboratories.

US DOE Office of Science logo

The FastForward exascale research program issued its request for proposals on March 29, and asked that they be submitted by May 11. The program seeks to fund basic research in exascale computing as it relates to three areas: Memory, processors, and storage and I/O.

It has an explicit goal of trying to solicit cooperation across multiple companies, much like the US Defense Advanced Research Project Agency's Ubiquitous High Performance Computing program. In a way, the UHPC program at DARPA is the trailblazer for the FastForward program at DOE.

DARPA always first to fight

The UHPC program was announced in March 2010 with the goal of creating an HPC system that by 2018 can do 50 gigaflops per watt (BlueGene/Q, the current top performer and most efficient super in the world, can do a little more than 2 gigaflops per watt) and pack 10 petabytes of storage and do around 3 petaflops of number crunching into a slight larger server rack than is standard and within a 57 kilowatt power budget.

Building an exascale system would seem easier, by comparison, since there is, in theory, no limit on the size of the machine or its power budget. But in reality, there are big-time power limits on exascale supers because no one is going to build a 20 megawatt nuclear or coal power station to keep one fed and cooled.

In August 2010, two teams were awarded UHPC ExtremeScale contracts with a total of $74m: one lead by Nvidia and the other Intel. Nvidia got a $25m grant and has teamed up with Cray, Oak Ridge National Lab, and six universities. Intel teamed up with three universities, SGI, Lockheed Martin, Cray, Reservoir Labs, and ET International to take down a $49m grant.

In three related UHPC grants, Sandia National Lab has teamed up LexusNexus and two universities, MIT has its own grant, and so does Georgia Tech, apparently. Total funding for the UHPC effort is said to be on the order of $100m, but DARPA has never confirmed that figure.

Three steps to DOE-sponsored exascale computing

With the FastForward program, the DOE is setting a cap of $20m on any proposals to try to encourage focused work on specific problems, and said at the get-go that what it was looking for was more like two $10m proposals in each of the three areas of primary research.

It is not clear how many awards have been made yet – the vendors are not notified of who was bidding and who won, but rather that they won. At the moment, AMD has been awarded a FastForward contract for processor and memory research and Whamcloud has one contract for storage and I/O research. There could be – and probably will be – others getting grants. Uncle Sam likes to hedge its HPC bets.

Once the primary research on possible exascale technologies is completed over the next two years, DOE will be looking at funding vendors to put together prototypes – this is tentatively called the system design phase – and then, by 2020, to build full exascale systems based on those prototypes – known as the system build phase at the moment. DOE will no doubt come up with other names later on.

According to the statement of work (PDF) for the FastForward contract, the issues that vendors face on the exascale challenge are daunting.

On a current petaflops-class system today, it costs somewhere between $5m and $10m to power and cool the machine today, and extrapolating to an exascale machine using current technology, even with efficiency improvements, you would be in for $2.5bn a year just to power an exascale beast and you would need something on the order of 1,000 megawatts to power it up. That's 50 nuclear reactors, more or less. The DOE has set a target of a top juice consumption at 20 megawatts for an exascale system.

Using DDR3 main memory today, a 2 petaflop machine with 2PB of main memory burns about 1.25 megawatts, and assuming that we can get to DDR5 main memory by 2020, we're talking about needing 260 megawatts just for the memory subsystem in an exascale box. Even if you cut the memory-to-flops ratio by a factor of five, which many people don't think is a good idea, and you are above 50 megawatts just for the memory subsystems across a cluster.

In addition to power consumption, memory components are not getting as cheap as CPU components, and memory bandwidth is not keeping up with the ever-increasing core count on processors and thus memory latencies are increasing.

There are resiliency issues with all of the components in an exascale system, which will have large numbers of components frying all the time. And then you are going to have billions of compute elements, and there has to be a hierarchy of memory and interconnects to keep them all fed and communicating with each other as simulations run.

Worse still, programming these petaflops machines is a complete bitch, and an exaflops system will be in the range of old battle-axe mother-in-law. Beyond that, you are programming against Death.

On the processor front, during the FastForward phase, the DOE is looking to better measure and control the power use in processors and integration with memory, network, and optics from the CPU or hybrid CPU-GPU chip, as the case may be. On its wish list, the DOE wants automatic rollback after faults or synchronization errors and better fault detection and correction.

Boosting the movement of data onto and off of the chip is also key, as is handling collective operations across compute elements, and software-controlled placement of data on the chip and its memory hierarchy is also penciled in. Putting network interfaces on the processor is a requirement, and so is boosting the concurrency across many cores and many threads on the cores.

The compute elements of the FastForward potion of the project have to provide 50 gigaflops per watt at scale – that's the same level of performance per watt that DARPA is looking for its ExtremeScale UHPC project. The system has to have a mean time between application failure of six days or larger.

This doesn't sound so great until you realize the system will have trillions of components and that today, with petaflops-class machines, it is on the order of one to five days and, without check-pointing or other resilience mechanisms will drop to about six hours by 2020.

DOE would like to have compute nodes with more than 10 teraflops of double-precision number-crunching performance, 4TB/sec of aggregate memory bandwidth and more than 100GB of main memory; something on the order of 32GB to 640GB is preferred. Total bandwidth between a node and the interconnect that lashes them together should be in excess of 400GB/sec.

Similar topics

Other stories you might like

  • Prisons transcribe private phone calls with inmates using speech-to-text AI

    Plus: A drug designed by machine learning algorithms to treat liver disease reaches human clinical trials and more

    In brief Prisons around the US are installing AI speech-to-text models to automatically transcribe conversations with inmates during their phone calls.

    A series of contracts and emails from eight different states revealed how Verus, an AI application developed by LEO Technologies and based on a speech-to-text system offered by Amazon, was used to eavesdrop on prisoners’ phone calls.

    In a sales pitch, LEO’s CEO James Sexton told officials working for a jail in Cook County, Illinois, that one of its customers in Calhoun County, Alabama, uses the software to protect prisons from getting sued, according to an investigation by the Thomson Reuters Foundation.

    Continue reading
  • Battlefield 2042: Please don't be the death knell of the franchise, please don't be the death knell of the franchise

    Another terrible launch, but DICE is already working on improvements

    The RPG Greetings, traveller, and welcome back to The Register Plays Games, our monthly gaming column. Since the last edition on New World, we hit level cap and the "endgame". Around this time, item duping exploits became rife and every attempt Amazon Games made to fix it just broke something else. The post-level 60 "watermark" system for gear drops is also infuriating and tedious, but not something we were able to address in the column. So bear these things in mind if you were ever tempted. On that note, it's time to look at another newly released shit show – Battlefield 2042.

    I wanted to love Battlefield 2042, I really did. After the bum note of the first-person shooter (FPS) franchise's return to Second World War theatres with Battlefield V (2018), I stupidly assumed the next entry from EA-owned Swedish developer DICE would be a return to form. I was wrong.

    The multiplayer military FPS market is dominated by two forces: Activision's Call of Duty (COD) series and EA's Battlefield. Fans of each franchise are loyal to the point of zealotry with little crossover between player bases. Here's where I stand: COD jumped the shark with Modern Warfare 2 in 2009. It's flip-flopped from WW2 to present-day combat and back again, tried sci-fi, and even the Battle Royale trend with the free-to-play Call of Duty: Warzone (2020), which has been thoroughly ruined by hackers and developer inaction.

    Continue reading
  • American diplomats' iPhones reportedly compromised by NSO Group intrusion software

    Reuters claims nine State Department employees outside the US had their devices hacked

    The Apple iPhones of at least nine US State Department officials were compromised by an unidentified entity using NSO Group's Pegasus spyware, according to a report published Friday by Reuters.

    NSO Group in an email to The Register said it has blocked an unnamed customers' access to its system upon receiving an inquiry about the incident but has yet to confirm whether its software was involved.

    "Once the inquiry was received, and before any investigation under our compliance policy, we have decided to immediately terminate relevant customers’ access to the system, due to the severity of the allegations," an NSO spokesperson told The Register in an email. "To this point, we haven’t received any information nor the phone numbers, nor any indication that NSO’s tools were used in this case."

    Continue reading

Biting the hand that feeds IT © 1998–2021