Arm will today unveil two trinities – a family of three CPU cores, and a family of three GPU cores – for system-on-chips powering future laptops, smartphones, smart home entertainment equipment, and similar gear.
The CPU cores are the Cortex-X2, A710, and A510, which are Armv9 compliant with SVE2 features. In fact, we're told, these are the first Armv9 Cortex-series cores. The Armv9 data-center-grade Neoverse N2 was unveiled in April.
The new GPUs are the Mali-G710, G510, and G310. Gluing together combinations of these CPU and GPU cores on the dies are Arm's CoreLink CI-700 coherent interconnect and the CoreLink NI-700 network-on-chip interconnect, also announced today. Arm has placed these designs under a marketing umbrella called its Total Compute Solutions.
Typically when you get a handheld using CPUs designed by Arm, the cores are arranged in a so-called big.Little cluster: this features a clan of powerful but power-hungry cores, and a set of smaller, less powerful but more battery-friendly cores. The operating system decides which cores to activate to run the user's applications, balancing the need for processing power against battery life.
These clusters tend to use something like the Cortex-A78 for the big cores, and the Cortex-A55 for the lighter cores. Now Arm's touting the Cortex-A510 as the lower-end workhorse, and the A710 and X2 for the larger cores, with the X2 being beefier than the A710.
- Arm freezes hiring until Nvidia takeover, cancels everyone's 'wellbeing' allowance
- You're V1 for me, says Arm: Chip biz's 'highest-performance core' takes aim at supercomputers, AI, anything relying on vector math
- FreeBSD gives ARM64 green light for production over x86 alternative's 'growth trajectory'
- Qualcomm closes acquisition of chip designer Nuvia: First custom cores to launch next year
Potential configurations for future system-on-chips using this technology include four X2s and four A710s for laptops; one X2, three A710s, and four A510s for top-end smartphones; and dual A710s and six A510s for home electronics, such as smart TVs and assistants.
Using a DynamIQ Shared Unit – which features an L3 cache, control logic, and external interfaces – labeled the DSU-110 with the aforementioned CoreLink interconnects, it should be possible to build a chip with eight Cortex-X2 cores, 16MB of L2 cache, and 32MB of unified system-level cache. Arm reckons this arrangement will exceed last year's "mainstream laptop" peak single-thread performance. The DSU-110, which has a ring-based architecture, was designed for Armv9-A CPU clusters.
As you would expect, Arm claimed its X2, A710, and A510 outperform their predecessors, meaning theoretically faster system-in-chips in future devices. Here's a quick summary of what caught our eye for each of the cores:
The X2 is a followup to last year's X1, and is designed for laptops and large-screen flagship smartphones. Arm described the X2 as its "most powerful CPU to date" which is pretty much what it said about its server-grade Neoverse V1 last month, too.
As a high-performance core, the X2 is not ashamed to draw from your battery when it's tapped up by an OS to give a device bursts of extra processing oomph. The CPU is 64-bit-only as part of Arm's long plan to dump 32-bit mode entirely from all of its mobile big.Little processor designs by 2023. It also supports 128-bit-length vectors.
Arm has decoupled the branch prediction unit from the instruction fetcher for the X2, allowing the predictor to run ahead faster, and said it has improved the conditional branch prediction accuracy, which leads to better performance. Speaking of speculative execution, we're also told the trinity of CPU cores have been designed from the start to reduce side-channel leaks a la Spectre.
The X2 has dropped a dispatch pipeline stage, and is described by Arm as having an "overall ten-cycle pipeline." This out-of-order execution core has a reorder buffer of more than 288 entries. It can feed eight macro-ops at a time from the decode-rename-commit stages into issue queues, one for an integer unit containing ALUs and branch takers; one for floating-point and SIMD operations; and one for the backend cache and memory access unit, which has been improved to reduce stalls and increase the load-store window.
Arm also disclosed that the X2 can have 512KB or 1MB of L2 cache as required by the customer. An 8MB L3 cache is expected to be configured, too.
Like the X2, the A710 has also benefited from improvements to its branch prediction accuracy, and has an overall ten-cycle pipeline, according to Arm. We're told the A710 makes fewer accesses than its A78 predecessor to its link to the outside world – its DynamIQ Shared Unit – and fewer RAM accesses, which decreases power consumption.
The A710 can have 32 or 6KB of L1 instruction cache, which is parity checked, and 32 or 6kKB of L1 data cache that's protected by ECC. There's also 256KB of 512KB of L2 ECC cache. It supports 32-bit and 64-bit mode code.
This is said to be a 64-bit-only three-wide, in-order design in which two cores can form a cluster that share a NEON and SVE2 SIMD unit and up to 512KB of L2 cache. Each core has its own 32 or 64KB of L1 instruction cache and same again for the L1 data cache. The SIMD unit can have two 64-bit or two 128-bit pipelines. A single-core cluster is possible, if required. On the front-end, the A510 can fetch up to 16 instruction bytes and decode three instructions per cycle.
A built-in arbiter will be used to share the SIMD unit and its two data paths with the two A510 cores in a cluster, if needed. Arm claims there is "minimal overhead" when both cores use the vector unit. The unit can execute scalar floating-point, NEON, and SVE2 instructions – SVE2 includes SVE.
By the time you read this, you can find more details about the Cortex cores here from Arm.
The Mali GPUs and interconnects
As their names suggest, the Mali-G710 is Arm's high-end graphic processor aimed at flagship devices, the G310 is the entry-level GPU, and the G510 falls in between.
The G710 is said to have a redesigned 16-lane, dual-data-path execution engine, and double the texture unit capabilities of its predecessor, the Mali-G78, among other improvements. It can also handle machine-learning workloads, and sports seven to 16 shader cores, and two to four L2 slices that can be 256 or 512KB each. The Mali job manager has been replaced with the Command Stream Frontend (CSF), which is said to be better suited to handling Vulkan API features.
There's also a G710 variant called the Mali-G610 that has the CSF, one to six shader cores, and one to four L2 cache slices. For more info on these GPU cores, Arm has technical details here.
Similarly, you can find info on the CoreLink CI-700 and NI-700 interconnects here. Expect to see these CPU, GPU, and glue logic designs appearing in devices next year or so, semiconductor supply chains willing. ®