How this open source LLM chatbot runner hit the gas on x86, Arm CPUs

Way to whip that LLaMA's ass

A handy open source tool for packaging up LLMs into single universal chatbot executables that are easy to distribute and run has apparently had a 30 to 500 percent CPU performance boost on x86 and Arm systems.

The project is called llamafile, and was created by Justine Tunney with support from Mozilla.

There are a ton of models out there that you can download and experiment with on your own system, as we've previously covered in detail. Ultimately these models are just very large files of numbers that describe neural networks – you need to have software that can open and parse a model, and know how to run input prompts and queries through the neural net to generate output for the user.

One such piece of software is llama.cpp – a plain C++ program developed primarily by Georgi Gerganov. Though llama.cpp set out to support Meta's LLaMA series of models – hence the name – it can also handle a boatload of other LLMs, such as Mistral-7B and Orion-14B.

Inspired by Meta's original Python-based LLaMA driver, llama.cpp is pretty cool in that it has no dependencies, works on Windows, Linux, macOS, and FreeBSD, at least, and can take advantage of hardware acceleration – from Nvidia GPUs to Apple, Intel, and AMD extensions.

You can build and run llama.cpp natively, give it a model to load, and then interact with that LLM in various ways. Where it gets tricky is the model files involved are usually quite large in size, and it can be a bit confusing to know which variant is best to use.

And this is where Llamafile is useful – by combining a chosen LLM file with llama.cpp to produce a single universal executable that can run on macOS, Windows, Linux, FreeBSD, OpenBSD, and NetBSD, with 64-bit x86 and Arm system processors. What makes it possible is the frankly magical Cosmopolitan Libc project, which allows C/C++ code to be built in such a way that a resulting single executable just runs on the aforementioned OSes and CPU architectures.

It massively simplifies distribution, if you just want to give a model to someone to try out. Neat.

Speed boost

Just a few days ago, Tunney blogged in depth how she implemented 84 new matrix multiplication kernels to boost llamafile's CPU performance during inference by 30 to 500 percent when using models with FP16 or Q8_0 type weights. We're told the "improvements are most dramatic for ARMv8.2+ (eg, RPI 5), Intel (eg, Alderlake), and AVX512 (eg, Zen 4) computers."

Tunney tested the code on a wide range of hardware – from the modest but cheap Raspberry Pi 5 to AMD's 96-core flagship Threadripper Pro 7995WX. In almost every case for both the Mistral 7B and TinyLlama 1.1B models, the improved llamafile (version 0.7) was comfortably ahead of llama.cpp (version 2024-03-26) and leaps ahead of llamafile 0.6.2. To be clear here: The big gains pretty much happen during prompt evaluation, when the LLM is processing input data. During the output (aka evaluation) stage, improvements were less dramatic.

For instance, on an Intel Skylake Core i9-9900, prompt processing jumped 50 percent versus llama.cpp, whereas evaluation stayed the same.

Although the reported performance boosts are for FP16 and Q8_0 data type weights, limited testing for other types also showed big improvements. On the Core i9-9900 using the Q4_0 variant of Mistral 7B, prompt performance was 65 percent higher with llamafile. The Threadripper Pro 7995WX showed performance more than doubled using FP32, which was also achieved in Mistral 7B.

It wasn't a clean sweep for llamafile 0.7, though. The Apple M2 Ultra-powered Mac Studio saw some performance regression in both prompt and evaluation performance for the Q8_0 data type. This was apparently because llama.cpp is already optimized on Apple hardware, and Tunney didn't opt for Apple's proprietary compiler.

It even beats Intel's matrix multiplication software

Achieving such impressive performance gains was a multi-step process, which Tunney documented in fine detail. By her estimation, vanilla llama.cpp's performance is 233 gigaFLOPS on her Core i9-9900 PC, and that can be turned up to 384 gigaFLOPS when enabling Intel's Math Kernel Library (MKL).

While the performance gains that come with MKL are great, the fact that it's closed source is less than ideal for this open source effort, according to Tunney. She noted that "integrating foreign BLAS libraries into llama.cpp isn't that practical, due to the way its threading model works." And since MKL is closed source, it's not possible to just look at it and see how it can be improved.

That apparently didn't deter Tunney, who wrote: "I believe the trick with CPU math kernels is exploiting instruction level parallelism with fewer memory references." From there, the developer showed how unrolling two outer loops in llama.cpp resulted in code that can run at 810 gigaFLOPS using OpenMP on an Intel Alderlake i9-14900K with 6400 MT/s RAM. By contrast, the same code run through MKL is only 295 gigaFLOPS.

Compatibility issues meant OpenMP couldn't be used for llamafile, but a custom kernel framework was able to mostly retain performance at 790 gigaFLOPS. That's over twice as fast as the fastest implementation using MKL.

While this solution is quick, it doesn't scale well with more complexity. MKL wins (Tunney didn't say by how much) when complexity is cranked up to 1,024 from 512. However, Tunney suggested that for the time being this isn't a critical issue – since llama.cpp runs smaller problem sizes by default, and she expects to figure out how to optimize for larger sizes eventually.

The optimizations and support for BF16 have been submitted upstream to llama.cpp itself, and the reception seems positive. Gerganov said the merge requests will be reviewed in the coming days. ®

More about


Send us news

Other stories you might like