ARM Holdings wants chip makers to bring more cores and cache to bear as they craft server chips based on its Cortex family of system-on-chip (SoC) designs, and to that end the company is boosting the on-chip caching and main memory controllers of its current ARMv7 and future ARMv8 designs to make them better able to compete against x86 systems.
The new CoreLink CCN-504 that ARM is showing off today at the Linley Tech Processor Conference in San Jose is a cache coherent network that lashes up to four quad-core Cortex-A15 processors and from 8MB to 16MB of L3 cache into a fully coherent, single system image. This network is at the very heart of the SoC and is what links the cores to each other, to the cache, to main memory controllers, and to peripheral controllers that are not resident on the processors.
In many cases, chip makers will take the updated CoreLink cache network and use it to link to controllers that they put on the die, and there is even a chance that some vendors will use the cache coherent network to link ARM processor cores and GPU or other kinds of coprocessors into a single complex that does hybrid ceepie-geepie computing on a single die.
This heterogeneous computing with CPUs and GPUs sharing L3 cache reduces the need to pass data from CPU to GPU and back again and to go out to main memory, which saves both time and energy in the supercomputing uses that ARM Holding and its enthusiasts see in their future. This hybrid approach, linking CPUs and GPUs through the L3 cache network is called big.LITTLE by ARM, and as you might expect, the first thing we need to do is give that feature a proper name that doesn't look like a ransom note.
ARM envisions that companies will be interested in plunking DSPs and other kinds of accelerators onto a SoC, and one interesting possibility might be the Epiphany line of RISC coprocessors created by Adapteva. Imagine putting two quad-core ARM chips and two of the 64-core Epiphany coprocessors into a single SoC.
ARM's new CoreLink CCN-504 cache coherent network
The updated cache coherency network supports double the cores of the current generation used in the Cortex-A9 chips, and it is compatible with the quad-core Cortex-A15 reference design from ARM Holdings and its derivatives as well as with the impending 64-bit ARMv8 designs that are expected to start rolling out next year from a variety of vendors.
Applied Micro Circuits very much wants to be first to deliver an ARMv8-based server chip with its X-Gene processor, but we'll see. It is not clear that Applied Micro is using CoreLink to lash together cores and caches. At the moment, storage chip and array maker LSI and ARM server chip upstart Calxeda (which just raised $55m in funding in its second round this week) are the first licensees for the new cache circuits.
The cache controller network has a bandwidth of around 1Tb/sec and runs as high as CPU clock frequencies. It has a 128-bit bus and an integrated snoop directory to minimize the amount of broadcasting you have to do over computing elements to keep them coherent across the L3 cache.
The CCN-504 design also has clock gating on processors and L3 cache segments, allowing you to shut down either in bits to save on power. You can, if the workload allows it, completely power down the L3 cache after backing it up to memory and run cores with their L2 caches. The cache network also supports up to 18 AMB 4 AXI4 or ACE-Lite peripheral ports in addition to the processor ports and two memory controller ports.
Incidentally, there is also a new memory controller, called the CoreLink DMC-520, that is designed to work with the CCN-504 design. These controllers support DDR3, low-volt DDR3, and DDR4 memory sticks. The DDR4 spec was just published by the JEDEC Solid State Technology Association, and will run at 1.2 volts instead of the 1.5 and 1.35 volts of DDR3 sticks and have memory chips that range in size from 2Gb to 16Gb.
DDR4 is not expected to be ramped up until 2015, but should start trickling into systems in 2014. ARM shou7ld be on the front end of that transition rather than on the back-end if it wants to get some leverage, with DDR4 memory running twice as fast, at 3.2GHz, than DDR3 memory and using less power, too.
ARM also says that the CCN-504 caching coherency network is just the first in what will eventually be a family of designs, so do not think for a second that this will be the limit of scalability on ARM SoCs aimed at servers. Lead licensees of ARM intellectual property can get the CCN-504 and DMC-520 designs today, and ARM expects for products from partners using the designs to start sampling their products next year. ®