Cray scales to over 100 petaflops with 'Cascade' XC30 behemoth
Aries interconnect, Dragonfly topology crush Gemini toruses
Hot on the heels of the delivery of the 20-plus petaflops "Titan" CPU-GPU hybrid supercomputer to Oak Ridge National Laboratory last week, Cray has launched what is unquestionably a much better machine, the long-awaited "Cascade" system developed in conjunction with the US Defense Advanced Research Projects Agency and sporting the new "Aries" interconnect.
The Aries interconnect is so important to hyperscale and parallel computing that Intel shelled out $140m back in April to get control of the people who created the Aries and predecessor "Gemini" interconnects, the chip designs themselves, and the 34 patents associated with them.
Cray retained exclusive rights to the use of Gemini and Aries, so you are not going to be able to buy an Aries chip at Newegg and build your own XC30 supercomputer. (Sorry.) Further down the road, Cray and Intel are working on a common supercomputer design called "Shasta," which may or may not use an interconnect similar to the fifth generation "Pisces" interconnect that Cray was kicking around as an idea two years ago.
What we do know is that Intel will be footing most of the bill for whatever the Shasta interconnect is, which suits Cray fine, apparently.
Now that Cascade is launched, Cray is willing to talk about a few things that were not disclosed about the project. Barry Bolding, who is currently vice president of storage and data management at Cray, worked at Cray Research two decades ago, then left to work for IBM for a while. Bolding came back to Cray when Peter Ungaro, who used to run the HPC biz for Big Blue, asked him to return to Cray and, specifically, to handle Cascade.
"This project was my baby for a while, and it was a very tough time on us," Bolding tells El Reg. But as the Gaffer says in Lord of the Rings, "All's well as ends better."
First, and this was a bit surprising, DARPA does not actually get its own Cascade machine for all of the money that it spent on Cascade development, but rather has access to a machine installed elsewhere for a number of months. Then if DARPA thinks the machine passes muster, the branches of the US military can decide on their own whether to buy a machine or not.
Cray gets to monetize all of DARPA's investments, as it did with prior systems funded by the government. It's good work, if you have the nerves of steel to keep your wits in the low-margin, high-stakes supercomputer racket.
In phase one DARPA's High Productivity Computing Systems program in 2003, Cray originally received $43.1m to begin work on the Cascade line of machines, which sought to converge various machines based on x86, vector, FPGA, and MTA multithreaded processors into a single platform. (GPU accelerators were not yet on the scene.)
In phase two of the HPCS effort, Cray received a $250m award in 2006 to work further on Cascade and also to create its Chapel parallel programming language, which is available now and open source.
IBM got $244m to work on its PERCS system, which was similar to but not the same as the ill-fated "Blue Waters" Power7-based 20 petaflopper that Big Blue pulled the plug on at the University of Illinois last year, leaving Cray wide open to win a $188m deal with an XK7 Opteron-Tesla hybrid machine.
Anyway, in January 2010, DARPA scaled back the Cray Cascade funding by $60m, and neither DARPA nor Cray ever explained why.
The XC30 supercomputer, formerly known as Cascade
Now, we know. According to Bolding, Cray was going to take the work it had done on its multistreaming vector processors and massively multithreaded ThreadStorm processors and create a new processor of its own to go along with the Aries interconnect.
"We eventually made the tough decision to focus on the interconnect," says Bolding. And that has clearly paid off, and as it turns out, has made it easier for Cray to embrace GPU and x86 coprocessors like Nvidia's Tesla and Intel's Xeon Phi as adjunct and more efficient compute engines alongside x86 processors inside its systems.
Die shot of the Aries interconnect chip
You can't blame Cray or DARPA or both for backing out of Cray's idea to create its own processor, particularly after Cray hitched its entire wagon to the Opteron processors from Advanced Micro Devices with the "SeaStar" XT and Gemini XE interconnects used in the similarly named XT and XE series of parallel supers.
Look at all the woe that comes from designing and fabbing processors, and how the Opteron delays time and again whipsawed Cray's revenues and profits. Worrying about interconnect chips was hard enough. And now, after Intel's shrewd move, that is Chipzilla's problem.
However, with Cray not doing interconnects or processors, that does call into question what its value-add will be during the Shasta generation of machines due around 2016.
Those XT and XE interconnects plugged directly into the HyperTransport ports of the Opteron processors. With the Aries XC interconnect, which is apparently short for Extreme Computing, Cray is plugging directly into the on-chip PCI-Express 3.0 controllers on the Xeon E5-2600 processors, which have plenty of bandwidth and which gives Cray the option of letting Aries speak directly to any other device – be it a CPU, a GPU, or some other kind of computing or storage element – that has a PCI-Express 3.0 port.
The way I have heard it described in the past, Gemini was not initially in the roadmap, but the US government wanted something between SeaStar and Aries as an interim device, offering somewhere between the scalability of the two while retaining the 3D torus interconnect that prior XT machines had used.
So Gemini is a less-capable high-radix router that implements the 3D torus interconnect familiar with the XT machines instead of the new "Dragonfly" topology of the Aries interconnect and used in the XC machines.
Schematic of the Aries interconnect and Cascade nodes
Let's go over the Aries interconnect itself first and how it compares to Gemini, then discuss the Dragonfly topology.
The Gemini interconnect, as El Reg detailed in May 2010 when it launched, had a 48-port high radix router called Yarc-2 (which was short for Yet Another Router Chip and also Cray spelled backwards) This router was used to virtualize links to the processors through four HyperTransport links to two pairs of Opteron processors.
The Gemini router had 168GB/sec of aggregate bandwidth and created two virtual network interfaces for each two-socket Opteron node hanging off the HyperTransport on one side of the chip and routed traffic out to other nodes in an XE6 all-Opteron and now XK7 Opteron-Tesla hybrid system through six ports on the other side of the router.
As you can see in the schematic above, the Aries chip is a 48-port routers as well, but it is implemented in a different way. Aries has four PCI-Express 3.0 pipes that link each two-socket Xeon E5 node into the chip.
The router's ports are bundled up to provide three different kinds of connectivity, linking nodes to a XC30 system chassis backplane for local hops, to copper cables that lash six enclosures (two racks) of machines to each other through standard copper cables, and to optical cables that link multiple racks together in a single XC30 system. Cray calls the backplane network Rank 1, the rack link network Rank 2, and the cross-rack optical links Rank 3.