Here's why your next network switch, storage box, or 5G gateway may do more Arm than good: E1, N1 data-center CPU cores aim at future kit
First-ever pure-64-bit-only Armv8-A SMT server processor brains may lure chip designers
If your humble Reg hack had a dollar for every Arm server processor pre-launch, launch, and car-crash failure he's witnessed, he'd be able to treat himself to a round of San Francisco's finest avocado on toast and a quad-espresso latte this morning. In other words, about 15 bucks.
And so, here we go again, once more with feeling, once more unto the breach. More Arm-compatible server-tier CPU cores. This time it may be different. This time, Arm has designed them itself.
Today, the Softbank-owned biz will unveil two proper homegrown Armv8-A 7nm CPU cores said to be suitable for the data center: the beefy N1 aimed at running workloads in the cloud, and the E1, which features simultaneous multi-threading (SMT) for, well, running highly threaded software. The N1 and E1 are like the Cortex-A72 and Cortex-A53 of the Arm server world.
These latest designs can be licensed by vendors and slotted into system-on-chips to act as the brains of so-called internet infrastructure equipment – software-defined networking devices, storage arrays, computer security systems, Internet-of-Things controllers, cloud compute servers, and combinations thereof. The resulting hardware is then packed into cloud giants' warehouses, or typical data centers.
You can thus take this news two ways: either you're a chip designer, cloud hyper-scaler, or equipment maker who's been waiting for something like an E1 or N1 to plonk into a system-on-chip (SoC) for your hardware, or you're an equipment buyer, and that means your next bit of gear might be powered by this tech.
Or you're a cloud user who has no idea nor a care what the underlying architecture is, and could be using E1 or N1 CPU cores in the near future oblivious to the fact. In any case, it's a potential new headache for Intel and AMD, which are also touting data-center processor parts.
A bit of trivia to start
Interestingly enough, the N1 supports 32-bit and 64-bit applications, and 64-bit-only kernels, whereas the E1 is pure 64-bit-only from kernel to user mode. Last year's Cortex-A76 was Arm's first CPU core to only support 64-bit kernel-mode code, while keeping support for 32-bit and 64-bit user-mode code. We're pretty certain the E1 is Arm's first-ever 64-bit-only Armv8-A offering.
It's a long way from the days of Calxeda biting the dust because it couldn't get its 64-bit Arm-compatible server processor out in time. Since then, Broadcom's Vulkan, Qualcomm's Centriq, and AMD's Seattle have fallen by the wayside as also-ran Arm-server chips, although Vulkan lives on to some degree at Cavium as the ThunderX2. On the bright side, Amazon showed up recently with rentable Arm Cortex-A72-based cloud processors, Ampere is said to be shipping its first 32-core Armv8 offering, and Huawei is touting a 64-core rival. So there is life yet in the world of Arm-based data center boxes.
Indeed, Arm has been teasing details of its server processor ambitions publicly for months if not privately for years, as it designed more and more powerful CPU cores for millions of smartphones, tablets, and laptops worldwide.
At some point, after watching some of its architecture licensees flail around with server chip products, Arm had to take a deep breath, and take the plunge, diving deep into the world of servers and infrastructure for the first time, and produce its own data-center-tier processor core designs. And so it did in October by unveiling its Neoverse brand, which at first peddled high-end Cortex-A cores for powering internet infrastructure kit.
Now, Neoverse has grown to include the E1 and N1, ready for system-on-chips in network switches, storage arrays, IoT controllers, and compute boxes that may or may not be near you depending on whether they are deployed on-premises or in the cloud. Now that's a mouthful.
The Neoverse E1
This is an Armv8.2-A 64-bit-only CPU with 32 or 64KB of four-way VIPT L1 instruction cache, 32 or 64KB of four-way VIPT L1 data cache, 64 to 256KB of private L2 cache, cryptographic engines for speeding up encryption, decryption and hashing algorithms in hardware, and a NEON AdvSIMD block. Up to eight E1 CPUs can sit together in a cluster with up to 4MB of L3 cache.
Crucially, it is an out-of-order execution CPU with SMT. That means, as with Intel's Hyper-Threading, the core can effectively attempt to run two software threads simultaneously. It is Arm's second SMT-capable core, the first being December's Cortex-A65AE. When the CPU is running two threads at once, they can each be running at different exception levels from each other, on different operating systems: as far as software is concerned, one physical core functions as two separate CPU cores. Instructions are fetched two at a time, as a 64-bit block, and fed into the core in a fair round-robin-like manner, one thread is primed with code, then the other, then the other, and so on.
These instructions flow through a 10-stage integer pipeline, or 12 stages for floating point: five instruction-fetch stages, the final two overlapping with the first two decode-rename-dispatch stages, the final stage of which feeds into one of three issue queues, where integer, floating-point or memory access operations are taken care of.
An E1 with 32KB of L1 cache, 128KB of L2, clocked at 2.5GHz, takes up 0.46mm2 of die space and consumes 183mW, presumably at 7nm, we're told. The E1 is what Arm had codenamed Helios, and is not to be confused with the yet-to-be-announced Zeus, a 7nm+ design due to land in silicon in 2020. The E1 is set to arrive this year in 7nm SoCs, perhaps as early as Q1 2019.
The CPU core is ultimately aimed at handling a lot of software threads at once: a suggested application is powering Wi-Fi and 5G communications kit, or software-defined network switches and firewalls. A 16-core, 32-thread E1 SoC for network edge controllers can, it is estimated, draw less than 15W and shift 256Gb/s in software with two channels of 72-bit DDR4-3200 RAM. Each of the SoC's pair of eight-core clusters is connected via Arm's mesh-like CMN-600 interconnect.
A 16 or 32-core edge aggregation SoC should be able to lob around multi-100Gbps connections, and an 8 or 16-core wireless gateway should be able to handle 10 to 25Gbps of connectivity, according to Arm.
The Neoverse N1
If the E1 is designed for nimble, multi-threaded software, the N1 is supposed to be the cloud workhorse, running more traditional workloads. The N1 is what Arm previously codenamed Ares, a 7nm design due to appear in chips in 2019, perhaps as early as this quarter, just like its E1 sibling.
The N1 is an Armv8.2-A CPU with 64KB of L1 instruction cache, 64KB of L1 data cache, 512KB or 1MB of private L2 cache, cryptographic engines, and NEON AdSIMD block. It is Arm's first core to offer system-wide instruction cache coherency. Its 11-stage pipeline is described as an accordion in that it can collapse to fewer stages depending on the instruction executing. Its four-stage four-wide instruction fetch and three-stage four-wide decode-rename stages feed into five issue queues that carry out branches, integer math (there are two of these queues), floating-point and vector operations, and memory address generation and data loading and storing.
It can run between 2.6GHz and 3.1GHz on a 750mV to 1V supply, consuming 1W to 1.8W, at 7nm although these are guidelines and aren't final numbers: it really depends on how the system-on-chip designer deploys them. The core takes up 1.2 to 1.4mm2 depending on whether 512KB or 1MB of L2 is used.
The N1 can scale up to 128 or more cores: each core is paired in a cluster, and the clusters are connected together in a mesh. Whoever makes these system-on-chips will have to resort to chiplets to get this many cores working. Each chiplet holds a bunch of cores, say 32, and then, for example, four chiplets are packaged together into one processor unit, providing 128 total CPU cores.
The idea is to throw 64 to 128 cores at hyper-scale cloud workloads, with each SoC drawing 150 or more watts; 16 to 64 at network-edge appliances, with each SoC consuming 35 to 105W; and eight to 32 cores at combined network, storage, and security data-center boxes, with the chips chewing on 25 to 65W.
Like its big.LITTLE designs for smartphones, fondleslabs and notebooks, SoC designers can combine a mix of E1 and N1 cores in a processor package to suit the target workload: the N1 cores can handle application-level code while the E1 cores scream away throwing data down the network. A high-performance packet-munching box could use 32 cores, 48 threads of E1 CPUs, and eight N1 cores, held together with a CMN-600 mesh and multi-port 100G Ethernet interfaces. The E1 cores would handle the data flowing through the kit, and the N1 cores would handle the control plane.
More to follow
As you'd expect for Arm, the emphasis of Neoverse is on doing more with less power and die area than its rivals. We'll know exactly how much energy the chips consume, and their final speeds and feeds, when they ship in silicon laden with peripherals and controller circuitry.
So far, this is a summary of the two core types: you can find more on our sister site, The Next Platform, here and here. There's also the official word on Arm's website, which should be updated with some more information by the time you read this. ®