Because Advanced Micro Devices has not yet announced its 16-core "Interlagos" Opteron 6200 processors, it has to talk about something, and in situations like that, it is best to talk about the far-off future. And so AMD rounded up a bunch of its partners on Wednesday in San Francisco for a shindig to talk about the challenges of exascale computing.
Chuck Moore, CTO in the chip maker's Technology Group, did the talking about exascale, or the desire to create machines that can deliver more than 1,000 petaflops of number-crunching performance. Moore was one of the lead architects of the "Bulldozer" core used in the forthcoming Opteron processors, as well as for the Fusion hybrid CPU-GPU chips, which AMD calls accelerated processing units, or APUs for short.
While Moore is hot on GPUs, he said this is not something new, so much as a return to the past with a twist. "GPU computing is still in its infancy," Moore explained. "Instead of thinking of computing on a GPU, you should be thinking of this as a revival of vector computing. Going forward, we will be developing GPUs that look more like vector computers."
That got a big hallelujah from Peg Williams, senior vice president of HPC Systems at supercomputer Cray, a descendent of one of the companies founded by Seymour Cray - the legendary supercomputer maker from Control Data (and the company that bears his name) and a man who forgot more about vector processors than most experts will ever know.
The issue, said Moore, is not getting to exascale performance, but getting to exascale performance within a 20 megawatt power budget by 2020 or so.
When Moore and his colleagues were thinking about the design of the Bulldozer core, they did some math and figures that to get somewhere between 1 petaflops to 10 petaflops of performance would eat up around 10 megawatts of power, depending on the system architecture, interconnect, and the scalability of the application software running on the massive cluster.
At the midpoint of the performance range, you are talking about needing 200 megawatts to power up an exascale machine. At $1 per watt per year, which is what these supercomputer labs cost, you are talking about needing $200m a year just to turn the machine on and cool it. So clearly, scaling up high-end x86 CPUs – or any big fat RISC chip – is not the answer.
That's why Oak Ridge National Lab's "Titan" machine is a mix of CPUs and Nvidia GPUs embodied in a Cray XK6 chassis. That machine is expected to scale from 10 to 20 petaflops. At 20 petaflops, about 85 per cent of the oomph in the Titan machine will be coming from the GPUs. CPUs, which handle all the serial work to keep the GPUs fed with numbers to crunch.
The AMD plan, says Moore, is to get a 10 teraflops Fusion APU into the field that only consumes 150 watts, and to use this as the basis of an exascale machine. "You start to think that maybe we can get there," said Moore, saying that he would put a stake in the ground and predict an exascale system could be built by 2019 or 2020.
The issue is not the CPU or the GPU, but rather memory bandwidth between the two devices and between the main memory they will share. This will involve stacking memory in 3D configurations on a chip package with these CPUs and GPUs.
"That is technology that doesn't exist today, but it will be here in time," Moore predicted.
The other trick will be to have a single memory address space for the GPUs and CPUs, but perhaps using different memory technologies to create different segments of main memory that would be more suited to CPUs or GPUs, and let the system try to use those blocks whenever possible.
This idea, like all others, is not new, of course. There are probably many examples, but the one that comes to my mind is the single-level storage architecture of the System/38, AS/400 and Power Systems machines from IBM. They treated cache, main, and disk storage as a single addressable space, meaning that programmers didn’t have to worry about pointers and moving data from disk to memory and back.
It was done automagically by the operating system so RPG programmers could focus on the business logic in their programs instead of worrying about data management. This is precisely the goal that everyone has for future supercomputer applications that span multiple computing architectures.
The use of Fusion APUs in supercomputers got its start today. Penguin Computing, an AMD reseller, announced that it has sold a 59.6 teraflops system to Sandia National Labs, one of the big US Department of Energy compute facilities. The 104-node system is based on AMD's A8-3850 APU and is plunked into the Altus 2A00 rack server. And yes, it can play Crysis. ®