Putting a petaflops into a single server rack isn't as difficult as the Defense Advanced Research Projects Agency had thought.
Back in March, DARPA — the research arm of the US military that brought you the Internet — put out a call to all nerds in the Ubiquitous High Performance Computing program to come up with an ExtremeScale supercomputer that packs one petaflops of number-crunching oomph into a single server rack. The ExtremeScale system also needed to run off a portable electric generator, consuming no more than 57 kilowatts, and cram compute, storage, networking, and cooling into a rack that is 24 inches wide by 78 inches high and 40 inches deep - a little wider and taller than a standard server rack - and deliver 50 gigaflops per watt on the Linpack benchmark test.
DARPA wants a new, easier to use programming model for parallel machines too, and it doesn't expect anyone to deliver an ExtremeScale box until around 2018, by which time there should be exascale-class systems delivering 1,000 petaflops of aggregate performance. If such machines don't burn through the Earth's mantle.
Thanks to graphics co-processors and other kinds of accelerators, cramming one petaflops into a single cabinet is the easy part of what DARPA is chasing.
According to Bill Mannel, vice president of product marketing at SGI, about four months ago, well ahead of DARPA's call for the ExtremeScale future system, SGI's top brass were sitting around in the lab and set a goal to be able to put a petaflops into one cabinet, instead of the hundreds of cabinets it takes today using clusters of x64 servers. And to minimize the amount of engineering work that needs to be done and to open up the architecture (thus staying away from proprietary interconnects), SGI decided that the PCI-Express peripheral bus is the best way to link co-processors into clusters to radically augment the floating point performance of x64 clusters.
This is easy enough to say, but the problem is this: The blade servers designed for HPC and general purpose workloads generally don't have PCI-Express peripheral slots. SGI's "UltraViolet" Altix UV shared memory supers do not have PCI-Express slots, and only a double-wide I/O blade in the Altix ICE clusters has a single PCI-Express slot; the Altix ICE compute blades don't have any. The Altix UV systems do have an external I/O chassis that allows for four GPU co-processors to be put into the system rack. But in no case are the GPUs sitting particularly close to the CPUs that are dispatching work.
This petaflops cabinet, known as Project Mojo, will have compute nodes that have lots of PCI-Express slots to attach GPUs - including both Tesla GPUs from Nvidia and FireStream GPUs from Advanced Micro Devices - for doing floating point math. SGI is also going to Tilera, a maker of multicore systems-on-a-chip, to accelerate integer-based supercomputing workloads, common in the life sciences. As El Reg previously reported, Tilera secured $25m in venture capital back in March from chip maker Broadcom, server and PC maker Quanta, and telecom giant NTT to fund its development and marketing of the 100-core Tile-GX100.
This chip, according to Tilera, will deliver about four times the integer oomph as an four-core Xeon 5500 chip, but burn only 55 watts doing in and thus delivering nearly eight times the integer operations per second as the x64 chip for about 30 per cent less money. Because of this, Quanta is engineering Linux-based server appliances based on the Tilera chips aimed at cloud computing service providers, and NTT is presumably kicking in money because it wants to do more with less in its data centers. (NTT didn't explain its investment).
SGI is being cagey about what the Project Mojo machine will look like. But Mannel did laugh when El Reg suggested that it would need more than two PCI-Express x16 slots per blade to get the job done, and then confirmed that this is true. Because of the need for PCI-Express slots on the blade of this petaflops box, it is also reasonable to guess that it will not be loaded up with main memory and local disk storage either. But let's do a little math here and see if we can suss it out.
The current way hybrid CPU-GPU systems are built is to pair a GPU with each CPU socket, but clearly this future SGI box will have more accelerators than CPUs, with the CPU being little more than a traffic cop for the GPU or Tilera co-processors. The new Tesla M2050 embedded and fanless GPU co-processor from Nvidia, which just started shipping a month ago, is rated at the same 515 gigaflops of double-precision and 1.03 teraflops single-precision flops. Let's give SGI the benefit of the doubt and only look at single-precision math, which is fine for some workloads. You'd need to put 971 of these Tesla M2050s in the cabinet to reach a petaflops of peak theoretical performance. That seems insane.
When supercomputer maker Appro International crammed four Tesla M2050s and a two-socket x64 server into a 1U chassis (the Tetra 1426G4, which you can see here), that was twice as dense as anyone else could do, and that is only 168 GPUs per rack. SGI has to take the density up by a factor of 5.8, cramming 23 GPUs into a 1U space. Assuming you use a 2U chassis and can mount the GPUs vertically in PCI slots, you are talking about having to put 46 GPUs and therefore 46 PCI-Express x16 slots onto a server. This doesn't seem physically possible.
Maybe SGI is talking about the peak performance of a rack when equipped with a future ATI FireStream GPU co-processor based on the same GPU that is at the heart of the FirePro V8800 graphics card, announced last month. THe V8800 has 2.6 teraflops of single-precision math oomph. Now you are talking about needing only 18 of these cards in a 2U chassis. While this still sounds crazy, given the power draw and heat thrown off - we're talking about over 4,000 watts for 18 GPU co-processors - this seems closer to possibility than a Project Mojo machine using the Nvidia Teslas.
It would not be surprising to see a tiny server board, maybe with only one Opteron 4100 socket but with four PCI-Express x16 slots, as the main board in such a system. You could assign two cores for each GPU in such a setup, and maybe put four of them on a server tray. And then run when you turn it on.
It will be very interesting to see how SGI does it. For one thing, the work "cabinet" may not be synonymous with a standard 42U server rack. The cabinet could be wider than a standard rack, like IBM's iDataPlex machines are (they are twice as wide, but half the depth, of a standard rack, allowing for server nodes to be shaped differently and made more dense).
Mannel is not giving away many clues about Project Mojo, but does say that SGI's ability to deliver that much oomph by the end of the year depends on the accelerator vendors sticking to their roadmaps. And that there will be more than one way to skin the cat.
"Within the horizon of a year, there will be multiple ways from multiple vendors to get to a petaflops in a cabinet," Mannel says. ®