Putting it all together: What does a complete package look like?
If single-thread performance is the most important thing for a piece of work, a core or set of cores will step down the threading automagically and run it with fewer processor threads. The Power8 core, said Stuecheli, has twice as much L1 data cache at 64KB compared to its predecessor (L1 instruction cache remains the same). Data buses from L1 to L2 cache on the die are now twice as wide at 64 bytes. The core has larger issue queues, improved branch prediction, can handle twice as many data cache misses, and has significantly beefed up prefetching of instructions and data. Add it all up, and at a 4GHz clock speed, a Power8 chip will yield about 1.6 times the single-threaded performance of a Power7 chip from 2010.
Block diagram of the Power8 chip
Each core has 512KB of SRAM memory etched right near it. A segmented NUMA-like L3 cache using what IBM calls a "non-uniform cache architecture" or NUCA for short, spans all twelve cores on the die, for a total of 96MB of L3 cache. That's only 8MB of L3 cache per core, compared to 10MB per core for the Power7+ chip announced last year, but the Power8 has a much more sophisticated main memory subsystem and an L4 cache that obviates the need for so much L3 cache on the die. (More on that in a second.) The L3 cache is implemented using embedded DRAM, as was the case with the Power7 and Power7+ processors.
At a 4GHz clock speed, you can move data into L3 cache from the external L4 cache at 128GB/sec and from the L3 cache out to L4 at 64GB/sec. Data can be crammed into L2 cache from L3 at 128GB/sec (or back out at the same bandwidth). The pipe from L2 cache into the cores has 256GB/sec of bandwidth, but only 64GB/sec in the other direction. Add it all up, across a twelve-core Power8 chip that works out to 4TB/sec of L2 cache bandwidth and 3TB/sec of L3 cache bandwidth.
The Centaur memory buffer/controller chip for Power8 CPUs
Chip makers have been putting memory controllers onto processors for quite some time now, but IBM has done something clever with the Power8. Instead of picking either an existing DDR3 or a future DDR4 controller for the die, Big Blue has instead created a generic memory controller for the die that speaks out over a high-speed bus to a memory buffer (and now quasi-controller) chip called Centaur. This chip is so named, says Stuecheli, because it is half L4 cache and half memory controller.
In this case, the Centaur chip is implementing DDR3 main memory, but should IBM want to shift out to DDR4 at some future time, it can swap out the memory cards and their integrated L4 cache and buffer chips that were designed for DDR3 memory for ones that use DDR4 chips without changing anything on the processors.
All of the memory scheduling logic, caching structures, and energy management features of what was an on-die memory controller with prior Power chips are now in the Centaur chip. That memory link between the Power8 package and the Centaur memory buffer chip has a 40-nanosecond latency and 9.6GB/sec of bandwidth. That Centaur chip is also implemented in IBM's 22-nanometer processes and includes 16MB of cache memory which is used as L4 cache by the processor.
Each Power8 chip can have up to eight of these Centaur chips, for a total of 128MB of L4 cache in a fully loaded socket. That socket would have eight memory channels, for a total of 230GB/sec of sustained bandwidth into and out of the processor and the 32 DDR memory ports hanging off one twelve-core chip would have 410GB/sec of peak bandwidth at the DRAM level.
With 32GB DDR3 memory sticks, each Power8 socket will be able to support 1TB of main memory, and presuming the high-end Power8 machine has 32 sockets like the Power7-based Power 795 server does, that means IBM can deliver a box with 32TB of memory across 384 cores and 3,072 processor threads.
The Power8 chip will also have integrated PCI-Express 3.0 controllers, bringing IBM's Power chips on par with competing Sparc T5 and M5 chips from Oracle and Xeon E5 (and soon Xeon E7) chips from Intel. Those PCI-Express ports have an aggregate of 48GB/sec of I/O bandwidth, significantly more than the 20GB/sec that the Power7 and Power7+ chips offered with the combination of the GX++ bus and I/O bridge chip that was used to implement PCI-Express 2.0 slots.
These integrated PCI-Express 3.0 controllers on the Power8 die provide the transport layer for what IBM is calling the Coherence Attach Processor Interface, or CAPI. And this interface will allow accelerators plugged into the PCI bus of a system - possibly GPU coprocessors or field programmable gate arrays - to easily access data and follow pointers in main memory just like processors themselves do. This is going to be very handy, and has a good chance of getting Big Blue back into the supercomputer racket in a way that didn't happen with the Power7-based beast formerly known as "Blue Waters".
Depending on the workload, a Power8 chip will yield somewhere around 2.5 times the performance as a baseline Power7+ chip. Again, we presume those are comparisons for chips running at 4GHz.
IBM will offer memory cards with 32GB, 64GB, and 128GB capacities, will have a variety of chip packaging options and will use the Power8 chip across a full line of machines, William Starke, the SMP architect for the Power processors, told El Reg. IBM is not being precise about when the Power8 will come to market, with rumours ranging from late 2014 to early 2015, but Starke said those rumours were wrong and that mid-2014 is a better timeline for system launches using the Power8 chips.
IBM was showing off a part, has systems of all sizes up and running in its labs using the Power8 chips, and has been designing the Power9 processor for quite a while already, according to Starke.
Big Blue is not ready to give up and let Intel have it all. Not just yet, and maybe not ever as long as customers keep buying its mainframes and Power Systems to do big jobs. ®