This article is more than 1 year old

Talk about a calculated RISC: If you think you can do a better job than Arm at designing CPUs, now's your chance

RISC-V, accelerator pressure leads to customizable Armv8-M instructions

TechCon Microprocessor designer Arm will allow chipmakers licensing its blueprints to, in certain circumstances, alter the holiest of holy scriptures: its CPU instruction set.

Specifically, the Softbank-owned biz will allow customers to add custom software instructions to Armv8-M CPU cores, free of charge, starting with the Cortex-M33 in early 2020.

These are microcontroller-grade RISC processor cores: think small, low-power brains directing sensors, gizmos, robots, appliances, industrial equipment, and so on. These aren't the beefy general-purpose Armv8-A CPUs running apps and games on your tablet or smartphone. Having said that, custom instruction support may come to the A-series eventually though for now, this all concerns the Armv8-M range.

"This makes sense for Armv8-M and R profile cores, where the software is intimate with the hardware," Arm senior director Thomas Ensergueix told The Register earlier this month. "Custom instructions are coming to Armv8-M. Maybe in future they'll come to the R class. I'm not excluding the A class, either."

Why is this interesting? Because perhaps rule one on day one, week one, of licensing Arm CPU cores is: you do not fsck around with the instruction set. Arm – seeking to avoid the architectural customization and subsequent fragmentation that tore apart one-time-rival MIPS – will not, generally speaking, let customers extend its collection of instruction sets. For there is one official Arm family; and there is none other but it.

Until now. Arm will let clients' engineers define custom instructions for Armv8-M cores within their own system-on-chips, and let them wire these specialized instructions to circuitry optimized to perform the operations. The resulting processors must still execute standard Arm instructions as expected, of course.

Start your engines

But why? Well, if you have an algorithm that, say, works with neural networks, and it needs to quickly perform a calculation or data manipulation that spans a bunch of instructions, you could define a single new instruction that carries out the whole process quickly in a dedicated engine. Now you can use this one instruction to perform a complex task in hardware.

The goal is to accelerate key algorithms by offloading parts to dedicated custom circuits within the system-on-chip rather than off to a separate accelerator, like a GPU or neural-network-processing ASIC. This in-core acceleration is expected to take fewer clock cycles, and draw less power.

OK, then, how will it work? Essentially, when you want to extend the instruction set of an Armv8-M core, you configure the decode section of the CPU to route your new instruction to your own execution logic rather than to the usual execute stage. The custom instruction should manipulate data in some way: it can't screw around with the internal operation of the core, we understand.

Crucially, a set of data lines and other signals carrying information extracted from the fetched custom instruction are provided to your custom logic block. You then write your Verilog, or whatever hardware design language you're using, to implement the instruction, using logic gates and these input data lines as necessary.

Then you provide output signals from your customized engine to the next stage of the CPU, which handles the setting of status flags, writing results to memory or registers, updating the program counter, and so on.

Arm's overview of custom instruction flow

Where exactly your custom instruction's execution logic fits into the CPU pipeline ... Click to enlarge. Source: Arm

The CPU core thus takes care of the stage interlocks, write backs, and other mechanisms: you simply – and "simply" is doing a lot of heavy lifting in this sentence – carve out part of the instruction space as your own, hook your custom acceleration engine to the provided data lines and control signals, and voila.

"Customers can go down to the RTL level, just as if they were configuring an FPGA, and insert their magic," said Ensergueix. "The decode stage gives the instruction data to the custom logic, and the pipeline takes the data back, sets the flags, and so on, taking care of all the interlocks and interleaving processes."

By the way, if you need help implementing this specialist logic, such as rolling multiple instructions or micro-operations into one, Arm is not here to help you. You get the data and control lines, and ability to configure the decode stage, for free with your licensed Armv8-M blueprints; you'll need to recruit and pay someone else to produce this bespoke in-core hardware for you if you struggle with microprocessor design.

Arm's overview of custom instruction abstraction

Suggested app code interfaces ... Click to enlarge

How will Arm prevent this from causing architecture fragmentation, in which manufacturers run amok with incompatible instruction set extensions? It's going to cross its fingers, and hope y'all use software libraries.

Basically, chip makers will be encouraged to come up with libraries and APIs that access their special instructions in a standardized way, and provide these frameworks to developers who then purchase and use the system-on-chips.

So, for instance, we could design a Vulture-9000 Armv8-M-based microcontroller, designed to control a bird-like drone with flapping wings, with special instructions for accelerating physics calculations to drive the motors. If you wanted to manufacture bird drones that use our chip, you'd order a batch of the components, and we'll provide a set of libraries for controlling the hardware from a high-level language, like C/C++ or Rust.

Within those libraries are routines using the custom instructions; the developer doesn't need to know the exact details, they just write firmware that calls the low-level API that abstracts, or hides, away the custom instructions.

That's the plan, anyway. Also, exactly how these special instructions are marketed and documented will be left up to the system-on-chip designers licensing the customizable Armv8-M microcontroller cores.

"Custom instructions are provided by our silicon partners," said Ensergueix. "It's out of our hands: it's the responsibility of the silicon partners to say what built-in acceleration is present. We're not going to standardize and categorize what our partners are doing. There could be dozens or hundreds of implementations and combinations."

But, really, why is Arm doing this? One seemingly obvious answer is the rise of RISC-V, OpenPower, and other open hardware, following in the footsteps of, if anyone can remember it, OpenSparc. RISC-V – a customizable open-source instruction-set specification and ecosystem of freely available CPU cores – is a growing migraine for Arm, forcing it to adjust its business model to see off the upstart rival. One thing RISC-V lets you do is extend the instruction set, as Alibaba has done, which its fans hope won't lead to MIPS-style fragmentation.

People bursting into the party

IBM hears the RISC-V kids partying next door, decides it will make its Power CPU ISA free, too


With more and more chip makers and engineers considering, or committing to, booting out Arm's technology and instead dropping in low-end customized royalty-free RISC-V cores in their components, Arm hopes to lure said techies back to its fold, offering to let them customize their Armv8-M CPUs as needed.

"It is definitely a competitive space," Ensergueix told us, adding that Arm wanted to allow its customers to implement customized functionality "in a more elegant way" than using coprocessors and external accelerators. "Our customers are asking for it, and we're addressing their needs."

By the way, Apple's Arm-compatible A13 system-on-chip in its latest iPhones features a matrix-math coprocessor, dubbed the AMX, for accelerating machine-learning and augmented reality algorithms. Arm's custom instruction program lets engineers do the same sort of bespoke engines for their own processors.

Another answer is less obvious though just as interesting. A load of organizations and corporations are designing their own custom accelerators, such as Google and its TPU family. These engines need small internal management and maintenance cores to direct the flow of information, schedule work, control buses and interfaces, and such tasks. Perhaps Arm wishes to stay at the center of this space by offering a subset of its CPU cores with the ability to accelerate customer-specified processes, as some kind of silicon surrogates.

For what it's worth, Arm prefers the term "chassis" rather than surrogate, as in, its "CPU is a chassis" for its customer base of processor designers. In other words, Arm would like you to license an Armv8-M core, hollow it out, inject highly customized and optimized accelerator logic into it, keep the pipeline stages and interlocks and other boring but necessary plumbing, and just forget all about that RISC-V stuff.

The custom instruction program will be announced today at the Arm TechCon conference in Silicon Valley. By the time you read this, there may well be more technical information about it all here. ®

In other TechCon news... Arm is shifting the governance of its Mbed OS project to give chip designers a greater say in how the thing is managed and developed.

There's a new version of embedded real-time OS VxWorks available from WindRiver, which supports C++17, Boost, Python, Rust, and the usual set of software languages.

UltraSoC is touting a security block called Bus Sentinel, due out in early 2020, that you can build into your system-on-chips. It polices internal accesses for malicious behavior – such as certain control register reads and writes well after system boot – so they can be blocked.

The CCIX Consortium – a group of chip designers, including Arm, AMD, Xilinx, and Qualcomm, that got together to create a PCIe-based CPU-accelerator interconnect – today announced CCIX Base Specification Revision 1.1 Version 1.0 with support for the PCIe 5 and transfer speeds of 32GT/s.

More about

More about

More about


Send us news

Other stories you might like