Bake an LLM with custom prompts into your app? Sure! Here's how to get started

In Rust, we trust. But in gen-AI to not hallucinate? Eh, that's another story

Hands on Large language models (LLMs) are generally associated with chatbots such as ChatGPT, Copilot, and Gemini, but they're by no means limited to Q&A-style interactions. Increasingly, LLMs are being integrated into everything from IDEs to office productivity suites.

Besides content generation, these models can be used to, for example, gauge the sentiment of writing, identify topics in documents, or clean up data sources, with of course the right training, prompts, and guardrails. As it turns out, baking LLMs for these purposes into your application code to add some language-based analysis isn't all that difficult thanks to highly extensible inferencing engines, such as Llama.cpp or vLLM. These engines take care of the process of loading and parsing a model, and performing inference with it.

In this hands on, aimed at intermediate-level-or-higher developers, we'll be taking a look at a relatively new LLM engine written in Rust called

This open source code boasts support for a growing number of popular models and not just those from Mistral the startup, seemingly the inspiration for the project's name. Plus, can be integrated into your projects using Python, Rust, or OpenAI-compatible APIs, making it relatively easy to insert into new or existing projects.

But, before we jump into how to get up and running, or the various ways it can be used to build generative AI models into your code, we need to discuss hardware and software requirements.

Hardware and software support

With the right flags, works with Nvidia CUDA, Apple Metal, or can be run directly on your CPU, although performance is going to be much slower if you opt for your CPU. At the time of writing, the platform doesn't support AMD nor Intel's GPUs just yet.

In this guide, we're going to be looking at deploying on an Ubuntu 22.04 system. The engine does support macOS, but, for the sake of simplicity, we're going to be sticking with Linux for this one.

We recommend a GPU with a minimum of 8GB of vRAM, or at least 16GB of system memory if running on your CPU — your mileage may vary depending on the model.

Nvidia users will also want to make sure they've got the latest proprietary drivers and CUDA binaries installed before proceeding. You can find more information on setting that up here.

Grabbing our dependencies

Installing is fairly straightforward, and varies slightly depending on your specific use case. Before getting started, let's get the dependencies out of the way.

According to the README, the only packages we need are libssl-dev and pkg-config. However, we found a few extra packages were necessary to complete the installation. Assuming you're running Ubuntu 22.04 like we are, you can install them by executing:

sudo apt install curl wget python3 python3-pip git build-essential libssl-dev pkg-config

Once those are out of the way, we can install and activate Rust by running the Rustup script.

curl --proto '=https' --tlsv1.2 -sSf | sh
. "$HOME/.cargo/env"

Yes, this involves downloading and executing a script right away; if you prefer to inspect the script before it runs, the code for it is here.

By default, uses Hugging Face to fetch models on our behalf. Because many of these files require you to be logged into before you deploy them, we'll need to install the huggingface_hub by running:

pip install --upgrade huggingface_hub
huggingface-cli login

You'll be prompted to enter your Hugging Face access token, which you can create by visiting


With our dependencies installed, we can move on to deploying itself. To start, we'll use git to pull down the latest release of from GitHub and navigate to our working directory:

git clone

Here's where things get a little tricky, depending on how your system is configured or what kind of accelerator you're using. In this case, we'll be looking at CPU (slow) and CUDA (fast)-based inferencing in

For CPU-based inferencing, we can simply execute:

cargo build --release

Meanwhile, those with Nvidia-based systems will want to run:

cargo build --release --features cuda

This bit could take a few minutes to complete, so you may want to a grab a cup of tea or coffee while you wait. After the executable is finished compiling, we can copy it to our working directory:

cp ./target/release/mistralrs-server ./mistralrs_server

Testing out

With installed, we can check that it actually works by running a test model, such as Mistral-7b-Instruct, in interactive mode. Assuming you've got a GPU with around 20GB or more of vRAM, you can just run:

./mistralrs_server -i plain -m mistralai/Mistral-7B-Instruct-v0.3 -a mistral

However, the odds are your GPU doesn't have the memory necessary to run the model at the 16-bit precision it was designed around. At this precision, you need 2GB of memory for every billion parameters, plus additional capacity for the key value cache. And even if you have enough system memory to deploy it on your CPU, you can expect performance to be quite poor as your memory bandwidth will quickly become a bottleneck.

Instead, we want to use quantization to shrink the model to a more reasonable size. In there are two ways to go about this. The first is to simply use in-situ quantization, which will download the full-sized model and then quantize it down to the desired size. In this case, we'll be quantizing the model down from 16 bits to 4 bits. We can do this by adding --isq Q4_0 to the previous command like so:

./mistralrs_server -i --isq Q4_0 plain -m mistralai/Mistral-7B-Instruct-v0.3 -a mistral

Note: If crashes before finishing, you probably don't have enough system memory and may need to add a swapfile — we added a 24GB one — to complete the process. You can temporarily add and enable a swapfile — just remember to delete it after you reboot — by running:

sudo fallocate -l 24G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

Once the model has been quantized, you should be greeted with a chat-style interface where you can start querying the model. You should also notice that the model is using considerably less memory — around 5.9GB in our testing — and performance should be much better.

However, if you'd prefer not to quantize the model on the fly, also supports pre-quantized GGUF and GGML files, for example these ones from Tom "TheBloke" Jobbins on Hugging Face.

The process is fairly similar, but this time we'll need to specify that we're running a GGUF model and set the ID and filename of the LLM we want. In this case, we'll download TheBloke's 4-bit quantized version of Mistral-7B-Instruct.

./mistralrs_server -i gguf --quantized-model-id TheBloke/Mistral-7B-Instruct-v0.2-GGUF --quantized-filename mistral-7b-instruct-v0.2.Q4_0.gguf

Putting the LLM to work

Running an interactive chatbot in a terminal is cool and all, but it isn't all that useful for building AI-enabled apps. Instead, can be integrated into your code using Rust or Python APIs or via an OpenAI API-compatible HTTP server.

To start, we'll look at tying into the HTTP server, since it's arguably the easiest to use. In this example, we'll be using the same 4-bit quantized Mistral-7B model as our last example. Note that instead of starting the in interactive mode, we've replaced the -i with a -p and provided the port we want the server to be accessible on.

./mistralrs_server -p 8342 gguf --quantized-model-id TheBloke/Mistral-7B-Instruct-v0.2-GGUF --quantized-filename mistral-7b-instruct-v0.2.Q4_0.gguf

Once the server is up and running, we can access it programmatically in a couple of different ways. The first would be to use curl to pass the instructions we want to give to the model. Here, we're posing the question: "In machine learning, what is a transformer?"

curl http://localhost:8342/v1/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer EMPTY" \
-d '{
"model": "Mistral-7B-Instruct-v0.2-GGUF",
"prompt": "In machine learning, what is a transformer?"

After a few seconds, the model should spit out a neat block of text formatted in JSON.

We can also interact with this using the openAI Python library. Though, you will probably need to install it using pip first:

pip install openai

You can then call the server using a template, such as this one written for completion tasks.

import openai

query = "In machine learning, what is a transformer?" # The prompt we want to pass to the LLM

client = openai.OpenAI(
    base_url="http://localhost:8342/v1", #The address of your server
    api_key = "EMPTY"

completion = client.completions.create(


You can find more examples showing how to work with the HTTP server over in the Github repo here.

Embedding deeper into your projects

While convenient, the HTTP server isn't the only way to integrate into our projects. You can achieve similar results using Rust or Python APIs.

Here's a basic example from the repo showing how to to use the project as a Rust crate – what the Rust world calls a library – to pass a query to Mistral-7B-Instruct and generate a response. Note: We found we had to a make a few tweaks to the original example code to get it to run.

use std::sync::Arc;
use std::convert::TryInto;
use tokio::sync::mpsc::channel;

use mistralrs::{
    Constraint, Device, DeviceMapMetadata, GGUFLoaderBuilder, GGUFSpecificConfig, MistralRs,
    MistralRsBuilder, ModelDType, NormalRequest, Request, RequestMessage, Response, SamplingParams,
    SchedulerMethod, TokenSource,

fn setup() -> anyhow::Result<Arc<MistralRs>> {
    // Select a Mistral model
    // We do not use any files from HF servers here, and instead load the
    // chat template from the specified file, and the tokenizer and model from a
    // local GGUF file at the path `.`
    let loader = GGUFLoaderBuilder::new(
        GGUFSpecificConfig { repeat_last_n: 64 },
    // Load, into a Pipeline
    let pipeline = loader.load_model_from_hf(
    // Create the MistralRs, which is a runner
    Ok(MistralRsBuilder::new(pipeline, SchedulerMethod::Fixed(5.try_into().unwrap())).build())

fn main() -> anyhow::Result<()> {
    let mistralrs = setup()?;

    let (tx, mut rx) = channel(10_000);
    let request = Request::Normal(NormalRequest {
        messages: RequestMessage::Completion {
            text: "In machine learning, what is a transformer ".to_string(),
            echo_prompt: false,
            best_of: 1,
        sampling_params: SamplingParams::default(),
        response: tx,
        return_logprobs: false,
        is_streaming: false,
        id: 0,
        constraint: Constraint::None,
        suffix: None,
        adapters: None,

    let response = rx.blocking_recv().unwrap();
    match response {
        Response::CompletionDone(c) => println!("Text: {}", c.choices[0].text),
        _ => unreachable!(),

If you want to test this out for yourself, start by stepping up out of the current directory, creating a folder for a new Rust project, and entering that directory. We could use cargo new to create the project, which is recommended, but this time we'll do it by hand so you can see the steps.

cd ..
mkdir test_app
cd test_app

Once there, you'll want to copy the mistral.json template from ../ and download the mistral-7b-instruct-v0.2.Q4_K_M.gguf model file from Hugging Face.

Next, we'll create a Cargo.toml file with the dependencies we need to build the app. This file tells the Rust toolchain details about your project. Inside this .toml file, paste the following:

name = "test_app"
version = "0.1.0"
edition = "2018"

tokio = "1"
anyhow = "1"
mistralrs = { git = "", tag="v0.1.18", features = ["cuda"] }

name = "main"
path = ""

Note: You'll want to remove the , features = ["cuda"] part if you aren't using GPU acceleration.

Finally, paste the contents of the demo app above into a file called

With these four files, Cargo.toml, mistral-7b-instruct-v0.2.Q4_K_M.gguf, and mistral.json in the same folder, we can test whether it works by running:

cargo run

After about a minute, you should see the answer to our query appear on screen.

Obviously, this is an incredibly rudimentary example, but it illustrates how can be used to integrate LLMs into your Rust apps, by incorporating the crate and using its library interface.

If you're interested in using in your Python or Rust projects, we highly recommend checking out its documentation for more information and examples.

We hope to bring you more stories on utilizing LLMs soon, so be sure to let us know what we should explore next in the comments. ®

Editor's Note: Nvidia provided The Register with an RTX 6000 Ada Generation graphics card to support this story and others like it. Nvidia had no input as to the contents of this article.

More about


Send us news

Other stories you might like