Intel teaches Xeon Phi x86 coprocessor snappy new tricks
The interconnect rings a bell
Hot Chips It took fifteen years for Intel to shrink the computing power of the teraflops-busting ASCI Red massively parallel Pentium II supercomputer down to something that fits inside of a PCI-Express coprocessor card – and the Xeon Phi coprocessor is only the first step in a long journey with coprocessor sidekicks riding posse with CPUs in pursuit of exascale computing.
The evolution is remarkable – ASCI Red ate up 104 cabinets to break through 1 teraflops of double-precision floating point performance, as rated by the Linpack parallel Fortran benchmark test.
Intel has been showing off the performance of the "Knights Corner" x86-based coprocessor for so long that it's easy to forget that it is not yet a product you can actually buy. Back in June, Knights Corner was branded as the "Xeon Phi", making it clear that Phi was a Xeon coprocessor even if it does not bear a lot of resemblance to the Xeon processors at the heart of the vast majority of the world's servers.
At last week's Hot Chips conference in Cupertino, California, Intel's senior principle engineer for the Knights family of coprocessors George Chrysos did not get into many of the feeds and speeds of the Xeon Phi architecture, instead sticking to talking about the architecture of the x64 cores inside the coprocessor, along with their ring interconnects and other features that will make it a good coprocessor for certain kinds of simulations and similar number-crunching jobs.
What Intel has said is that the Xeon Phi will have at least 50 cores and at least 8GB of GDDR5 graphics memory to feed those cores. In June, when Intel released the Linpack tests results on a 79-node cluster to get it on the Top 500 supercomputer ranking for June 2012, El Reg did the math and discovered that it looks like Intel built the Discovery cluster with two cards in each node, with each coprocessor having 54 cores activated.
The Xeon Phi design almost certainly has 64 cores on the die – Intel has admitted as much with prior benchmarks and statements – but as is the case with any multicore GPU, it's devilishly difficult to get yields on a new process such that all cores in a complex chip are working properly.
We estimate the Xeon Phi's clock speed at somewhere between 1.2GHz and 1.6GHz, depending on how many cores are active on the die. The more cores, the lower the clock speed has to be to reach the magical teraflops performance level.
The PCI card housing a Xeon Phi coprocessor
With 54 active cores baked in the 22-nanometer Tri-Gate process used to etch the Xeon Phi chips, Intel would be doing as well as Taiwan Semiconductor Manufacturing Corp did with its 40-nanometer processes making Nvidia Fermi GPUs. Those came out with 512 cores on the die, but only 448 cores were active in the first generation of products. As the 40nm process ramped, Nvidia could get to the full 512 cores. And, as TSMC is ramping its 28-nanometer processes to make the GK104 and GK110 GPUs for the Tesla K10 and K20 server coprocessors, you can bet it won't be able to get all the cores to work on the first go, either.
Not Just a Pentium in GPU drag
The Knights family of coprocessor chips might be salvaged from the dead remains of the "Larrabee" x86-based GPU project, but it packs features that Intel has cooked up specifically to make it a coprocessor for x86-based workstations and departmental servers running technical workloads, as well as in parallel supercomputer clusters that need better bang for the buck – and for the watt – than a Xeon or Opteron processor can bring all by itself.
It all starts in the Xeon Phi core, which is a heavily customized Pentium P54C – which is kinda funny when you consider how close that is to the Pentium II cores used in the ASCI Red super. But this core has been modified so much Red would barely recognize it.
Block diagram of the Xeon Phi core
A couple of interesting points to notice about the Xeon Phi core that Chrysos pointed out: first, the core does in-order execution and can handle four threads at the same time. The chip has a two-wide issue pipeline, and the four threads will look like HyperThreading to Linux software, but Chrysos says that the threads are really there to mask misses in the pipeline. Each core has 32KB of L1 instruction cache and 32KB of data cache, plus a 512KB L2 cache.
Interestingly – and taking a swipe at those ARM fanbois who want to believe that the x86 architecture can't scale down – Chrysos says that on the Xeon Phi chip, the x86-specific logic takes up less than 2 per cent of the core and L2 real estate.
The vector processing unit on the Xeon Phi core is a completely new design, Chrysos tells El Reg, and processes 512-bit SIMD instructions instead of the 128-bit or 256-bit AVX instructions you find in modern Xeons.
That VPU is capable of processing eight 64-bit double-precision floating point operations, or sixteen 32-bit single-precision operations in a clock cycle. Chrysos says this 2:1 ratio of SP to DP on floating point operations is "good for circuit optimization," and adds that when the chip is doing DP math, the SP units share their multiplier units to help speed up the calculations.
The VPU has a feature new to Intel called the Extended Math Unit, or EMU, which supports vectorized transcendental operations such as reciprocal, square root, and exponent instead of relying on polynomials and coefficients to estimate the results of such operations, in yet another case of specialized circuits doing the math in hardware instead of in software or using dirty-trick short cuts.
Also new to Intel chips and burned inside of the VPU is a scatter-gather unit, which does cache-line management that is particularly useful for programs written sequentially but which need to be parallelized and vectorized.
The ring architecture of the Xeon Phi chip rings a bell
Multiple Xeon Phi cores are connected to each other with high-speed ring interconnects on the chips, just like Intel has done with the Xeon E5 processors and will do soon for the Itanium 9500 processors for servers.
This ring is a bit different, though. In fact, it is has three rings. There is a block ring for shipping data to the cores through their L2 caches, and two address/acknowledgement rings (AD/AK) where requests are sent and coherency messages are sent to all of the cores on the die to keep their tag directories in synch. There is not SMP-like cache coherency across all of those cores and caches, as far as El Reg knows.
The GDDR5 main memory is interleaved across the cores and accessed through these rings as well, which hook to two GDDR5 memory controllers on the die. By linking GDDR5 memory ports onto the ring, the interleaving around the cores and rings smoothes out the operation of the coprocessor when all the cores are working hard.
Under normal circumstances, Intel would have only put in one data ring and one AD/AK ring. But after some performance testing, Intel doubled-up the AD/AK rings so that data passing around the ring does not get in the way of memory requests or acknowledgements, hurting performance.
Streaming stores and multiple rings boost Xeon Phi performance
In the chart above, you can see what happened to a Xeon Phi chip once it got to 32 cores running the Stream Triad memory-bandwidth benchmark test with only one AD/AK ring. Simply put, it crapped out. Adding the second AD/AK ring boosted the performance at 50 cores by around 40 per cent on Stream Triad.
Another feature, called Streaming Stores (and new with the Xeon Phi chips), significantly cuts down on the amount of bandwidth needed to do full cache-line stores, and added another 30 per cent to memory bandwidth performance on real work.
The Xeon Phi chip has the capability of supporting the TCP/IP protocol over the PCI-Express bus, which means you can Telnet into or use sockets to link into a Xeon Phi coprocessor card just like you would any other independent Linux-based server. The TCP/IP-over-PCI support also means that Xeon Phi cards within a system can talk directly to each other.
You can also use the Remote Direct Memory Access (RDMA) protocol from InfiniBand to let external server nodes hosting any specific Xeon Phi coprocessor card talk to other Xeon Phi cards on the InfiniBand network, and the RoCE, short for RDMA over Converged Ethernet, variant of direct memory addressing also works linking Phis to other Phis if Ethernet floats your boat.
In terms of power management, the Xeon Phi chip has clock and power gating on each core, and when the cores have been in a low power state for long enough, the L2 caches and interconnect rings are clock-gated. If you sit still long enough, the L2 caches and interconnects are shut down after dumping their contents to the GDDR5 memory.
Phee, Phi, Pho, Phum
What Chrysos could not talk about at Hot Chips was the feeds, speeds, SKUs, and pricing of the Xeon Phi chips – that is what Chipzilla hires marketing people to do as a product gets closer to launch.
We don't know when that will be, but Intel has said that the Xeon Phi coprocessor will be in production by the end of the year, presumably to ship to OEM customers either this year or early next. All we know for sure is that the "Stampede" supercomputer going into the University of Texas in January is the only publicly announced machine that admits to its schedule. ®