Silicon Graphics is betting big on Intel's latest Xeon E5-4600 processor and its own revved up NUMAlink 6 shared memory interconnect, creating a "big brain computer" that can gang up to 4,096 cores into a single system image to run massive Linux workloads and fairly large Windows jobs, too. The new UV 2 is exactly the kind of box, says SGI, that customers with big data warehouse, big database, big data, and traditional HPC workloads have always wanted – and in many cases could never have afforded.
But the shift to new packaging and lower-cost Xeon E5 processors from Itanium and then Xeon E7 chips from Intel have made the shared memory systems from SGI more broadly accessible at just the same time that many workloads seem to be busting out of general-purpose four-socket boxes. This is good news for SGI, which has had its share of financial woes as it chases the capricious and fiercely competitive HPC and hyperscale data center markets.
SGI will also be pleased to note that Intel has not yet got interconnect fabrics woven into its Xeon processors and chipsets, although it is clearly working on that with the acquisition of Cray's family of HPC interconnects back in April, its purchase of the InfiniBand chip and switch business from QLogic in January, and the Ethernet switch chip business Fulcrum Microsystems back in July 2011.
However, SGI still has a good window in which to capitalize on its NUMAlink interconnect before Intel does whatever it's going to do to integrate interconnects with its CPUs and chipsets. It would not be surprising to see SGI sell the NUMAlink biz to Intel for a big chunk of change, or maybe even an acquisitive Advanced Micro Devices or Hewlett-Packard. In fact, it would not be surprising at all if HP just upped and bought SGI to get out of its Itanium conundrum with Oracle. But so far, SGI seems content to go it alone and to peddle rack and shared memory systems all by its lonesome.
A rack's worth of SGI's UV 2000 supercomputer
SGI put out a bit of a preview on the UV 2 lineup when Intel launched the Xeon E5-4600 processors a little more than a month ago. At the time, the company said that it was switching away from the Xeon 7500 and E7 and their multiple QuickPath Interconnect (QPI) ports. SGI had also said it was moving away from the "Boxboro" 7500 chipset that it had used to interface with the NUMAlink 5 interconnect for lashing nodes tightly together in a memory-coherent fashion. The UV 1000 high-end machines were based on a two-socket blade.
The Xeon 7500 and E7 chips have four QPI ports coming off each socket, and the original UV 1000 design used two QPI ports on the Xeon 75000 or E7 chips to cross-link the two sockets together, with one of the remaining two QPI ports going to the Boxboro chipset (which controls access to main memory and local I/O slots on the blade) and the other that links out to the NUMAlink 5 hub, which in turn has four links out to the NUMAlink 5 router. That router implements an 8x8 (paired node) 2D torus that can deliver up to 16TB of shared space across those 256 sockets.
While SGI let it be known a month ago that it was ditching the Xeon E7s for the E5-4600s in the next-generation UV 2000 shared memory supers, the company did not say exactly how it was going to build these machines. (SGI had to save a little something to talk about at the International Super Computing conference in Hamburg, Germany this week, after all.) El Reg speculated that there would be a goosed interconnect and that SGI would stick to two-socket blades. We were right on the first count, but because there are two fewer QPI ports on the Xeon E5-4600 than on the Xeon 7500 and E7, the bandwidth between the ports would have been significantly diminished. It was easier and cleaner to make what is in effect a microserver and use the QPI ports to double up out to the new NUMAlink 6 interconnect hub, and that is what SGI has done.
SGI would have no doubt preferred to build the original UV 1000 machines, which debuted in November 2009 and which spanned 128 blades and 256 sockets in a shared memory configuration, using cheaper Xeon 5500 and 5600 processors. But these chips have only one QPI port coming off their sockets and their on-chip memory controllers cannot address as much memory as the Xeon 7500s and E7s, so SGI had no choice but to use the fat Xeons in 2009 and await the less expensive E5-4600s here in 2012.
The memory expansion on the E5-4600 chip is the key to the rejiggered UV 2000 machine, since each processor socket can currently hold a dozen memory slots and address up to 384GB of memory without any external memory buffers or funky chipsets. But the real secret sauce in the UV 2000 is the NUMAlink 6 interconnect, which is a substantial re-engineering of the NUMAlink 5 interconnect that offers about 2.5 times the bandwidth and a much simpler system design as well.
Jill Matzke, director of server marketing at SGI, says that with the NUMAlink 6, a bunch of different things happened all at once. First, SGI's chip fab partner, Avago Technologies, did a process shrink, allowing for more stuff to be crammed onto the chip. (Avago, which is a spinout of Agilent Technologies, itself a spinout from Hewlett-Packard, doesn’t actually make the NUMAlink chips; a fab in Taiwan does.) So SGI could take two of the NUMAlink hubs and put them onto a single chip. SGI could also bring the NUMAlink router onto the ASIC for the first time. Equally important, some of the functions that had been performed by the NUMAlink hub and router using the Xeon 7500 and E7 chips are now done by the Xeon E5s themselves; PCI-Express controllers are one new on-chip function. This is a much simpler set of NUMAlink ASICs. (And you can see now why Intel wants to control the interconnects.)
With the UV 1000 design, there was a node controller in the blade chassis – which the nodes in the chassis shared – and a NUMAlink router at the top of the rack. With the UV 2000, more of the router functionality is contained in that NUMAlink hub/node controller that is on the system board and the node controllers are doubled up for bandwidth. You can scale across two racks of UV 2000 machines without using an external top-of-rack router.
But, says Matzke, if you want to add extra bandwidth across those E5-4600 unisocket blades, you can add NUMAlink 6 routers at the top of the racks, too. This allows customers to dial up the CPU and bandwidth scalability independently of each other with the UV 2000, something you could not do with the UV 1000. The NUMAlink 6 interconnect provides 6.7Gb/sec of bi-sectional bandwidth.
The basic node on the UV 2000 has two single-socket servers with a vertical extender card sandwiched between the two stacked motherboards and linking them together with a NUMAlink 6 hub chip. This packaging is similar, in concept, to the "Gemini" blade used in the ICE X Xeon E5-2600 clusters that were previewed last November at SC11 and that started shipping in March of this year. A 10U chassis holds eight half-width nodes, with up to 128 cores and 4TB of memory. A single rack has four of these, for up to 512 cores and 16TB of memory; and a fully loaded UV 2000 has eight racks for a total 2,048 cores and 64TB of global shared memory. If Intel had switched on one more bit in the E5-4600 memory controller, SGI could have pushed the memory up to the full 128TB of memory it is physically possible to put in the 512 nodes in the fully loaded UV 2000 machine. But it didn't, so you can't.