Want to save the planet from AI? Chuck in an FPGA and ditch the matrix

Watts down, doc: Boffins find machine learning models can function with more modest power requirements

Large language models can be made 50 times more energy efficient with alternative math and custom hardware, claim researchers at University of California Santa Cruz.

In a paper titled, "Scalable MatMul-free Language Modeling," authors Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, and Jason Eshraghian describe how the energy appetite of artificial intelligence can be moderated by getting rid of matrix multiplication and adding a custom field-programmable gate array (FPGA).

Using FPGAs with machine learning is not that new, but the way in which this particular optimization is achieved is, and separates it from previous work, such as that by Alemdar et al in 2016.

AI – by which we mean predictive, hallucinating machine learning models – has been terrible for keeping Earth habitable because it uses so much energy, much of which comes from fossil fuel use. The operation of datacenters to provide AI services has increased Microsoft's CO2 emissions by 29.1 percent since 2020, and AI-powered Google searches each use 3.0 Wh, ten times more than traditional Google queries.

Earlier this year, a report from the International Energy Agency [PDF] projected that global data center power consumption will nearly double by 2026, rising from 460TWh in 2022 to just over 800TWh in two years. The hunger for energy to power AI has even reinvigorated interest in nuclear power, because accelerating fossil fuel consumption for the sake of chatbots, bland marketing copy, and on-demand image generation has become politically fraught, if not a potential crime against humanity.

Jason Eshraghian, an assistant professor of electrical and computer engineering at the UC Santa Cruz Baskin School of Engineering and the paper’s lead author, told The Register that the research findings could provide a 50x energy savings with the help of custom FPGA hardware.

"I should note that our FPGA hardware was very unoptimized, too," said Eshraghian. "So there's still a lot of space for improvement."

The prototype is already impressive. A billion-parameter LLM can be run on the custom FPGA with just 13 watts, compared to 700 watts that would have been required using a GPU.

To achieve this, the US-based researchers had to do away with matrix multiplication, a linear algebra technique that is widely used in machine learning and is costly from a computational perspective. Instead of multiplying weights (parameters assigned to link neural network layers) consisting of floating point numbers between 0 and 1, the computer scientists added and subtracted binary {0, 1} or ternary representations {-1, 0, 1}, thus demanding less of their hardware.

Other researchers over the past few years have explored alternative architectures for neural networks. One of these, BitNet, has shown promise as a way to reduce energy consumption through simpler math. As described in a paper released in February, representing neural network parameters (weights) as {-1, 0, 1} instead of using 16-bit floating point precision can provide high performance with much less computation.

The work of Eshraghian and his co-authors demonstrates what can be done with this architecture. Sample code has been published to GitHub.

Eshraghian said, the use of "ternary weights replaces multiplication with addition and subtraction, which is computationally much cheaper in terms of memory usage and the energy of actual operations undertaken."

That's combined, he said, with the replacement of "self-attention," the backbone of transformer models, with an "overlay" approach.

In self attention, every element of a matrix interacts with every single other element ... In our approach, one element only interacts with one other element

"In self attention, every element of a matrix interacts with every single other element," he said. "In our approach, one element only interacts with one other element. By default, less computation leads to worse performance. We compensate for this by having a model that evolves over time."

Eshraghian explained that transformer-based LLMs take all text in one hit. "Our model takes each bit of text piece by piece, so our model is tracking where a particular word is situated in a broader context by accounting for time," he said.

Reliance on ternary representation of data does hinder performance, Eshraghian acknowledged, but he and his co-authors found ways to offset that effect.

"Given the same number of computations, we're performing on par with Meta's open source LLM," he said. "However, our computations are ternary operations, and therefore, much cheaper (in terms of energy/power/latency). For a given amount of memory, we do far better."

Even without the custom FPGA hardware, this approach looks promising. The paper claims that by fused kernels in the GPU implementation of ternary dense layers, training can be accelerated by 25.6 percent while memory consumption can be reduced by 61 percent compared to a GPU baseline.

"Furthermore, by employing lower-bit optimized CUDA kernels, inference speed is increased by 4.57 times, and memory usage is reduced by a factor of 10 when the model is scaled up to 13B parameters," the paper claims.

"This work goes beyond software-only implementations of lightweight models and shows how scalable, yet lightweight, language models can both reduce computational demands and energy use in the real-world." ®

More about


Send us news

Other stories you might like