El Reg's essential guide to deploying LLMs in production

Running GenAI models is easy. Scaling them to thousands of users, not so much

Hands On You can spin up a chatbot with Llama.cpp or Ollama in minutes, but scaling large language models to handle real workloads – think multiple users, uptime guarantees, and not blowing your GPU budget – is a very different beast.

While a model might run efficiently on your PC with less than 4 GB of memory, the resources required to serve a model at scale are quite different. Deploying it in a production environment to handle numerous concurrent requests can require 40GB of GPU memory or more.

In this hands-on guide, we'll be taking a closer look at the avenues for scaling your AI workloads from local proofs of concept to production-ready deployments, and walk you through the process of deploying models like Gemma 3 or Llama 3.1 at scale.

Building with APIs

There are plenty of ways to integrate LLMs into your code base, but when it comes to deploying them in production, we strongly recommend using an OpenAI-compatible API.

This gives you flexibility to keep up with the rapidly changing model landscape. Models considered state of the art just six months ago are already ancient history. And since ChatGPT kicked off the AI boom in 2022, OpenAI's API interface has become the de facto standard for connecting apps to LLMs.

This approach also means you can start building applications that leverage LLMs using whatever's available. For example, you can start building with Mistral 7B in Llama.cpp on your notebook and then swap it out for Mistral AI's API servers when you're ready to deploy it in production. You're not stuck with one model, inference engine, or API provider.

Speaking of cloud-based inference services, this is usually the most capex-friendly way to scale up your AI deployments. There's no hardware to manage and no models to configure, just an API to point your app at.

In addition to the major model builders' API offerings, a growing pack of AI infra startups now offer inference-as-a-service for open-weight models.

However, not all of these providers are created equal. Some, like SambaNova, Cerebras, and Groq, use boutique hardware or techniques like speculative decoding to speed up inference, but offer a smaller catalog of models.

Others, like Fireworks AI, support the deploying custom fine-tuned models using Low Rank Adaptation (LoRA) adapters — a concept we've explored in depth previously. Given how diverse the AI ecosystem has become over the past two years, you'll want to do your research before committing to one over another.

So you want (or have) to do it yourself

If cloud-based approaches are off the table — whether it's for privacy or regulatory reasons or because your company already ordered a bunch of GPU servers — then you'll be looking at deploying on-prem. In that case, things can get tricky in a hurry. Let's start with some of the more common questions you're bound to run into.

What model should I use? This is going to depend on your use case. A model used primarily for a customer service chatbot is going to have different requirements than one used for retrieval augmented generation, or as a code assistant.

If you haven't already settled on a model, it's worth spending some time with API providers until you're confident you've found a model that will meet your needs.

How much hardware do I really need? This is a big question. GPUs are expensive and still in short supply. Thankfully, your model can tell you an awful lot about what it's going to take to run it.

Bigger models obviously need more hardware. But how much more? Well, you can get a rough estimate of the minimum GPU memory required by multiplying the parameter count (in billions) by 2GB.

For example, if you want to run a model like Llama 4 Maverick, you'll need a rough minimum of 800GB of GPU memory just to hold the weights. So you're likely looking at Nvidia HGX H200, B200 or AMD MI300X boxes from the get go. But if you're already getting good results with something like Phi 4 14B, then you may be able to get away with a pair of Nvidia L40s.

Note that this only applies to models trained at 16-bit precision. For 8-bit models, like DeepSeek-V3 or R1, you'll need just 1GB per billion parameters. And if you're willing to use model compression techniques like quantization, which trade quality for size, you can get this down to 512MB per billion parameters.

However, this is only a lower limit. In most cases, you'll need a fair bit more memory if you plan to serve that model to more than one person at a time. That's because in addition to the model's weights, you also have to account for the key-value cache. You can think of it like the model's short-term memory, and the more users and details it needs to keep track of, the more HBM or GDDR you'll need to budget for.

Nvidia's support matrix is a good place to start, as it offers some general guidance on which and how many GPUs you'll need to run many of the more popular open models.

Once you've sized your hardware to your model, you'll also need to think about redundancy. While you can certainly configure a single GPU node to run LLMs for business, what happens if it fails? So when sizing your deployment, you'll likely be looking at two more systems for failover and load balancing.

How should I deploy my model? There are several ways to deploy and serve LLMs in production. They can run on bare metal with old-fashioned load balancers distributing the load across them. You can deploy them in virtual machines, or you can spin up containers in Docker or Kubernetes and manage everything that way.

In this guide, we'll be looking at deploying LLMs using Kubernetes, as it neatly abstracts away much of the complexity associated with large scale deployments by automating container creation, networking, and load balancing.

Here's an example of how you may distribute an inference workload across three GPU nodes to distribute load and mitigate outages or downtime across the cluster.

How to spread an inference workload across three GPU nodes, to distribute the load and mitigate outages or downtime across the cluster ... Click to enlarge any image

That's not to say Kubernetes is easy. It's an whole can of worms unto itself, but it's one that many enterprises have already adopted and understand fairly well.

This is one of the reasons we've seen Nvidia, Hugging Face, and others gravitate toward containerized environments with their respective Nvidia Inference Microservices (NIMs) and adorably named HUGS (Hugging Face Generative AI Services), which have been preconfigured for common workloads and deployments. We'll take a look at how to use these a little later in this piece.

Which inference engine should I use? There's no shortage of inference engines for running models. We frequently use Ollama and Llama.cpp in our coverage as they'll work on just about anything, even a Raspberry Pi.

However, if your goal is to serve models at scale, then we tend to see libraries like vLLM, TensorRT LLM, SGLang and even PyTorch used more often.

For the purposes of this tutorial, we'll be looking at deploying models using vLLM, as it supports a wide selection of popular models and offers broad support and compatibility across Nvidia, AMD, and other hardware. However if you prefer to use something like SGLang or TensorRT LLM, most of the principals described here still apply.

Preparing our Kubernetes environment

Compared to your typical Kubernetes environment, getting pods to talk to your GPUs requires some additional drivers and dependencies. The process for setting this up is going to look different for AMD hardware than it would for Nvidia.

To keep things simple, we'll be deploying K3S in a single-node configuration. The basic steps are largely similar to multi-node environments, you'll just need to ensure the dependencies are satisfied on each GPU worker node, and may need some tweaks to the storage configuration.

In this guide, we'll be deploying vLLM on a simplified single-node Kubernetes cluster

In this guide, we'll be deploying vLLM on a simplified single-node Kubernetes cluster

Note: This is not intended to be an exhaustive guide on container orchestration and assumes some familiarity with Kubernetes concepts. Our goal is to provide a solid foundation for deploying inference workloads in a production-friendly manner. So while we will be walking through the steps to configure a GPU-accelerated Kubernetes node, your environment may look slightly different.

Prerequisites

For this guide you'll need:

  • A server or workstation with at least one supported AMD or Nvidia GPU board.
  • A fresh install of Ubuntu 24.04 LTS in order to keep dependencies manageable.

Nvidia dependencies

Setting up an Nvidia-accelerated K3S environment is relatively straightforward, but does involve a handful of dependencies.

We'll start by installing the CUDA Drivers Fabric Manager and Headless server drivers. In our case, the latest release available in the LTS repos is 570 so we'll go with that. We'll also install Nvidia's server utils as they are helpful for debugging any driver issues that crop up.

sudo apt install -y cuda-drivers-fabricmanager-570 nvidia-headless-570-server nvidia-utils-570-server

Next we'll add Nvidia's container runtime repo and pull down the latest version by running the following commands and then rebooting.

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-runtime
sudo reboot

Spinning Up K3S

With most of our dependencies set, we can move on to deploying K3S. For a single node deployment, this can be achieved by running:

curl -sfL https://get.k3s.io | sh -

Once the process completes, we can check to see if our node is ready to run on the machine with:

k3s kubectl get nodes

If everything worked correctly, you should see something like this:

NAME           STATUS   ROLES                  AGE     VERSION
kube-nv-prod   Ready    control-plane,master   3h45m   v1.32.3+k3s1

If you're running an AMD Instinct box you can move on to the next section, but if you're using Nvidia there are few more tweaks we need to make.

By default, K3S will detect that the Nvidia container runtime is installed and configure itself accordingly. However, it's best to double check by running grep. If everything worked properly you see the nvidia-container-runtime listed below.

sudo grep nvidia /var/lib/rancher/k3s/agent/etc/containerd/config.toml
sudo systemctl restart k3s

Deploying the GPU device driver

With K3S set up, we just need to give Kubernetes permission to allocate GPU resources to our pods. To do this we need to deploy the respective AMD or Nvidia device driver daemon to the cluster.

For Nvidia, this can be achieved by running:

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.1/deployments/static/nvidia-device-plugin.yml

AMD users, meanwhile, can install the drivers by running the following two commands:

kubectl create -f https://raw.githubusercontent.com/ROCm/k8s-device-plugin/master/k8s-ds-amdgpu-dp.yaml
kubectl create -f https://raw.githubusercontent.com/ROCm/k8s-device-plugin/master/k8s-ds-amdgpu-labeller.yaml

Note: All the commands we're showing here can be run directly on the server, but if you prefer to access your cluster remotely via kubectl, you can find the K3S config.yaml file under /etc/rancher/k3s/k3s.conf

After about a minute, we can check that our GPU has been detected by running:

kubectl describe node

If everything worked correctly, you should now see amd.com/gpu: 1 or nvidia.com/gpu: 1 under the Allocatable header.

With that, your GPU nodes should be ready for deployment and you can move on to Deploying Gemma 3 4B in vLLM (Nvidia or AMD) or Getting started with NIMs (Nvidia).

Deploying Gemma 3 4B in vLLM

To get started we'll be looking at how to deploy Gemma 3 4B using vLLM on a Nvidia L40S or AMD MI210-class part, as well as how you'd tweak your deployment to run on a larger eight-GPU.

To get started, let's take a look at the manifest file.

vllm-openai-gemma.yaml (Nvidia)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-openai
  labels:
    app: vllm-openai
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-openai
  template:
    metadata:
      labels:
        app: vllm-openai
    spec:
      runtimeClassName: nvidia
      hostIPC: true
      containers:
      - name: vllm-openai
        image: vllm/vllm-openai:latest
        args:
          - "--model"
          - "google/gemma-3-4b-it"
          - "--served-model-name"
          - "Gemma 3 4B"
          - "--disable-log-requests"
          - "-tp"
          - "1"
          - "--max-num-seqs"
          - "8"
            #- "--max_num_batched_tokens"
            #- "16000"
          - "--max_model_len"
          - "16000"
          - "--api-key"
          - "$(API_KEY)"
        ports:
          - containerPort: 8000
        env:
          - name: API_KEY
            valueFrom:
              secretKeyRef:
                name: vllm-api-key
                key: API_KEY
          - name: HUGGING_FACE_HUB_TOKEN
            valueFrom:
              secretKeyRef:
                name: hf-token
                key: HUGGING_FACE_HUB_TOKEN
        volumeMounts:
          - name: huggingface-cache
            mountPath: /root/.cache/huggingface
          - name: shm
            mountPath: /dev/shm
      volumes:
        - name: huggingface-cache
          hostPath:
            path: /root/.cache/huggingface
            type: Directory
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: "10Gi"
      resources:
        limits:
          nvidia.com/gpu: 1
          cpu: "12"
          memory: 20G
        requests:
          nvidia.com/gpu: 1
          cpu: "8"
          memory: 10G

As manifests go, this one is pretty standard. It spins up a single pod running the Gemma3 4B model in vLLM on a single Nvidia GPU — we've included an AMD-specific manifest below, but same basic concepts will apply to both.

With that said, there are a few elements — other than the vLLM arguments, which we'll get to in a bit — worth highlighting.

Replicas: The number of replicas dictates how many instances of the model we want to deploy. If you've got enough GPUs, you could have several instances of vLLM load balanced across a single server. For smaller models, like Gemma 3 4B, this is actually what AMD recommends.

Resources: Defines how much CPU, GPU, and system memory should be allocated to each pod. These need to be adjusted based on the size of your model.

If your LLM can't fit on a single GPU, you'll want to increase the number of GPUs it should request. If for example, you wanted to run DeepSeek R1 on an H200 or MI300X box with eight GPUs, you'd set the GPU limit and request both to 8.

vLLM configuration

Now, let's move on to the vLLM config itself. As you can see from the manifest, we've passed quite a few arguments to vLLM describing how we want to run the model. Some of these are required, like setting the model location — in this case it's pulling from Hugging Face — as well as the advertised model name, flags to disable log-requests, and the API key which we will pass along via a Kubernetes secret in a bit.

The rest will depend heavily on your specific hardware and use case. So let's dig into each.

tensor-parallel-size: Tensor parallelism enables us to distribute both the model weights and computational load across multiple GPUs.

This number should generally match the number of GPUs under the resources section discussed above. In our case, since we're demonstrating on a single-GPU node, this will be set to 1, which means it's effectively disabled.

max-num-seqs: Sets the upper limit for how many prompts or requests vLLM should process in a single batch. So if you set this to 8, vLLM will process at most 8 concurrent requests at once.

It can be more computationally efficient to set this to a higher number, but comes with the tradeoffs of higher memory consumption and longer wait times for responses to start streaming in.

So, if you know your LLM will be mostly handling a handful of requests at any given moment, it can actually be better to set this lower to reduce latency and memory consumption.

max_num_batched_tokens: Behaves similarly to --max-num-seqs but rather than concurrent requests, caps the maximum number of tokens vLLM should process at any one time.

The proper setting here will depend on whether you're optimizing for latency or throughput.

A smaller batch size puts more computational load on the system but also reduces memory consumption and latency. A bigger batch size, meanwhile, allows more tokens to be processed at once, allowing it to serve more users, but potentially increasing their wait time.

If you're unsure how to set this, you can actually omit it from the config and vLLM will automatically adjust the batch size based on the available resources.

max_model_len: This defines the maximum number of tokens each sequence should keep track of, and goes hand in hand with the --max_num_seqs we set earlier.

You can think of this a bit like a workbench. The --max_model_len describes how big each workbench is — or rather can be — and the --max_num_seqs describe how many workbenches are available for folks to use at any given moment. For a given space (memory) you can either have fewer bigger workbenches or loads of tiny ones.

vLLM will, by default, set this to the maximum context window supported by the model. On older models this was usually small, but these days, context windows routinely exceeded 128,000 tokens and some are now pushing a million-plus. Those tokens can take up a lot of memory, so, it's often necessary or even desirable to set the --max_model_len to a smaller value.

The graphic below details how these parameters impact memory requirements:

Here's a diagram breaking down how max_num_batch_tokens, max_model_len, and max_num_seqs impact GPU memory

How max_num_batch_tokens, max_model_len, and max_num_seqs impact GPU memory

vllm-openai-gemma.yaml (AMD)

As we mentioned earlier, if you're deploying workloads on AMD Instinct accelerators, the manifest file is going to look a little different.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-openai
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-openai
  template:
    metadata:
      labels:
        app: vllm-openai
    spec:
      hostIPC: true
      containers:
      - name: vllm-openai
        image: rocm/vllm:instinct_main
        securityContext:
          seccompProfile:
            type: Unconfined
          capabilities:
            add:
              - SYS_PTRACE
        command: ["/bin/sh", "-c"]
        args:
          - >
            vllm serve google/gemma-3-4b-it \
              --served-model-name 'Gemma 3 4B' \
              --disable-log-requests \
              -tp 1 \
              --max-num-seqs 8 \
              --max_model_len 16000 \
              --api-key $(API_KEY)
        ports:
          - containerPort: 8000
        env:
          - name: VLLM_USE_TRITON_FLASH_ATTN
            value: "0"
          - name: API_KEY
            valueFrom:
              secretKeyRef:
                name: vllm-api-key
                key: API_KEY
          - name: HUGGING_FACE_HUB_TOKEN
            valueFrom:
              secretKeyRef:
                name: hf-token
                key: HUGGING_FACE_HUB_TOKEN
        volumeMounts:
          - name: huggingface-cache
            mountPath: /root/.cache/huggingface
          - name: shm
            mountPath: /dev/shm
        resources:                 
          limits:
            amd.com/gpu: "1"
            cpu: "12"
            memory: "20Gi"
          requests:
            amd.com/gpu: "1"
            cpu: "8"
            memory: "10Gi"

      volumes:
        - name: huggingface-cache
          hostPath:
            path: /home/tobiasmann/.cache/huggingface
            type: Directory
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: "10Gi"

Note: This manifest is specific to running vLLM on Instinct accelerators. If you'd hoped to test things out on AMD workstation cards like the W7900, you'll need to build a compatible vLLM container from source by following this guide here, and then push it to your container registry of choice.

Tuning vLLM

While Gemma 3 4B is a pretty small model, requiring a little over 8GB of vRAM to fit the model weights at their native BF16 resolution, trying to use the manifest we just looked at on a 24GB Nvidia L4 will likely result in an out of memory (OOM) error.

If that's the case for you, you'll need to either lower the --max_num_seq, --max_model_len, or --max_num_batch_tokens, or some combination of the three.

These parameters can have a big impact on performance and user experience, so let's take a look at two different scenarios to see how you tune them differently.

Scenario 1: Corporate summarization assistant

Here's a visual of how you might set up a summarization assistant

How to set up a summarization assistant

Let's say you're building a genAI application to help search and summarize large documents for a team of 12. The likelihood that you'll have more than two people running a summarization task simultaneously will be pretty low, so we can set --max_num_seqs to something like two.

However because these documents are so big, you'll want to set --max_model_len and --max_num_batch_tokens large enough that the document fits within the context window without getting cut off. If your largest document is around 15,000 words, then you might want to set your --max_model_len to 32,000 to give yourself a buffer — remember that tokens not only represent words but punctuation marks too.

Intuitively, you might think if you have two 32,000 token sequences, you'd want to set your --max_num_batch_tokens to 64,000. However, that's a worse-case scenario, since not every document is going to be 15,000-plus words. For example, if one user summarized a 10,000-word doc and the other a 3,000-word doc, you wouldn't get anywhere close to the cap. Thankfully, vLLM takes a lot of the guesswork out of setting this parameter. If --max_num_batch_tokens is left unset it'll automatically rightsize itself based on the available memory.

Scenario 2: Customer service chatbot

For a customer service chatbot, you might prioritize a larger number of smaller sequences to maximize the number of concurrent users

For a customer service chatbot, prioritize a larger number of smaller sequences to maximize the number of concurrent users

Now, let's imagine you're building a customer service chatbot to help users learn about your product or overcome common challenges. Your parameters are going to look quite a bit different from the summarization scenario.

For one, the prompts and responses are going to be much shorter, but you'll likely be serving a larger number of concurrent users. In this case it might make sense to have a large --max_num_seqs like 16, 32 or more but a smaller --max_model_len of 1,024 or 2,048.

And again here, we can let vLLM figure out how to set our --max_num_batch_tokens for us.

Benchmarking

Depending on your specific use case, it'll likely be prudent to benchmark a couple of different configurations until you find one that achieves the desired combination of overall throughput, time-to-first token (the time folks have to wait for the chatbot to start responding), or second-token latencies (how quickly the answer generates).

Final preparations: With our vLLM manifest tuned to our liking, we can move on to spinning up vLLM. But first, we'll need to generate two secrets. One is for our Hugging Face token — Gemma 3 is a gated model, so don't forget to request access to the repo page first — and the second you'll use to access the vLLM API server later.

This can be achieved by running the following two commands swapping out HUGGING_FACE_TOKEN_HERE and TOP_SECRET_KEY_HERE for your token and key.

kubectl create secret generic hf-token --from-literal=HUGGING_FACE_HUB_TOKEN=HUGGING_FACE_TOKEN_HERE
kubectl create secret generic vllm-api-key   --from-literal=API_KEY=TOP_SECRET_KEY_HERE

Finally, you'll want to make sure the Hugging Face cache directory exists:

mkdir -p /root/.cache/huggingface

Note: To keep things simple, we're using bind mounts, but there's nothing stopping you from using NFS, a persistent volume store, or some other storage interface to maintain the cached Safetensor files from Hugging Face. You could also just remove that particular bind mount entirely. Just remember that if you do, vLLM will need to re-download them each time a new pod is spun up.

Spinning up vLLM

With that out of the way, we can spin up Gemma 3 4B in vLLM by running:

kubectl apply -f vllm-openai-gemma.yaml

We can then check the deployment by running a get pods command in kubectl.

kubectl get pods

You should see some vllm-openai-... "containerCreating." Once it shows as "running," we can then check the vLLM logs to see if there were any problems.

kubectl logs -l app=vllm-openai -f

After a few minutes, you should see:

INFO:     Application startup complete.

Ingress and load balancing

With the server up and running, you can now configure your Kubernetes ingress and load balancer as you would with any other container deployment. How this is done is going to depend on your specific Kubernetes environment, which ingress controller and load balancer you're using, and your security policies.

The basic idea here is that Kubernetes should expose an API address, and Kubernetes will automatically balance the load across all available GPU nodes.

If you've been following along and just want to make sure the vLLM server is working as intended, you can spin up a quick-and-dirty ingress service by running the commands below. Note that if you are going to replicate this, your front-end application needs to be running on the same subnet behind your network's DMZ, as it's served over standard HTTP. With those precautions out of the way, let's get into it.

We'll start by creating a new ClusterIP service by creating another YAML file called vllm-openai-svc.yaml containing the following:

apiVersion: v1
kind: Service
metadata:
  name: vllm-openai-svc
  namespace: default
spec:
  type: ClusterIP
  selector:
    app: vllm-openai
  ports:
    - port: 8000
      targetPort: 8000

Then to create the service we'll run:

kubectl apply -f vllm-openai-svc.yaml

Next we'll load our ingress configuration, creating a separate YAML file called vllm-openai-ingressroute.yaml containing the following, replacing YOUR_DOMAIN_NAME_HERE with the domain you plan to use.

apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: vllm-openai
  namespace: default
spec:
  entryPoints:
    - web
  routes:
    - match: Host(YOUR_DOMAIN_NAME_HERE)
      kind: Rule
      services:
        - name: vllm-openai-svc
          port: 8000
          sticky:
            cookie:
              name: VLLMSESSION
              secure: true
              httpOnly: true

We can then apply it by running:

kubectl apply -f vllm-openai-ingressroute.yaml

You can now update your local DNS server to point to the domain you set earlier. Modifying /etc/hosts

If you don't have access to your local DNS server (usually handled as part of the router config or as a standalone server), you can modify your local machine's /etc/hosts file in Linux to point to the domain to your Kuberentes node's IP.

sudo nano /etc/hosts

Append the following and then save and exit.

NODE_IP YOUR_DOMAIN_NAME_HERE YOUR_DOMAIN_NAME_HERE

Testing it out:

With everything configured we can now test that the server is running by running:

export VLLM_API_KEY="TOP_SECRET_KEY_HERE"
curl -i http://api.local.rambler.ink/v1/models \
  -H "Authorization: Bearer $VLLM_API_KEY"

If everything worked you should see something along the lines of:

HTTP/1.1 200 OK
Content-Length: 481
Content-Type: application/json
Date: Wed, 16 Apr 2025 20:06:56 GMT
Server: uvicorn
Set-Cookie: VLLMSESSION=48de0c44ee42ce61; Path=/; HttpOnly; Secure

{"object":"list","data":[{"id":"Gemma 3 4B","object":"model","created":1744834016,"owned_by":"vllm","root":"google/gemma-3-4b-it","parent":null,"max_model_len":16000,"permission":[{"id":"modelperm-e17493c83c8047c1b8ce3b082e4c4a61","object":"model_permission","created":1744834016,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}

Getting started with NIMs

The sheer number of levers and knobs necessary to flip and turn in order to achieve optimal throughput or latency is one of the reasons why Nvidia, Hugging Face, and others have gravitated toward pre-baked model containers that require little to no configuration to get up and running with.

Nvidia calls these Inference Microservices (NIMs) for short, while Hugging Face calls their version of these containers Hugs. These microservices aren't free. You can play with NIMs in a dev environment, but if you want to deploy them in production you'll need an AI Enterprise license that'll set you back $4,500/year per GPU or $1/hour per GPU in the cloud.

However, if you're already paying for said license, they're a no-brainer.

We'll go over the basics of deploying an NIM here, but you'll definitely want to check out Nvidia's docs for specifics on how to tune your configuration to best suit your Kubernetes environment.

Grabbing the dependencies

Deploying NIM on our Kubernetes cluster requires a few additional dependencies, namely Helm and the Nvidia GPU operator. To install Helm, we can simply run:

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \
    && chmod 700 get_helm.sh \
    && ./get_helm.sh

We can then add the Helm repo for GPU Operator and install it:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
    && helm repo update
helm install --wait --generate-name \
    -n gpu-operator --create-namespace \
    nvidia/gpu-operator \
    --version=v25.3.0

Adding the NGC Repo and API key

Before you can download any NIMs you'll need to generate an NGC API key by following the instructions here, and add it to your .bashrc or .zsh file.

export NGC_API_KEY=NGC_API_KEY_HERE
echo "export NGC_API_KEY=VALUE" >> ~/.bashrc

or

echo "export NGC_API_KEY=VALUE" >> ~/.zsh

Grab the Helm chart and add your customizations

Next we'll download the NIM LLM Helm chart. We're using version 1.7.0 but you can find the latest version here.

helm fetch https://helm.ngc.nvidia.com/nim/charts/nim-llm-1.7.0.tgz --username='$oauthtoken' --password=$NGC_API_KEY

Next we'll add the NGC repo and API key as secrets on our Kubernetes cluster.

kubectl create secret docker-registry ngc-secret --docker-server=nvcr.io --docker-username='$oauthtoken' --docker-password=$NGC_API_KEY
kubectl create secret generic ngc-api --from-literal=NGC_API_KEY=$NGC_API_KEY

Finally we'll create a configuration file for our deployment, which in this case is Llama 3.1 8B and save it as custom-values.yaml.

image:
  repository: "nvcr.io/nim/meta/llama3-8b-instruct"
  tag: 1.0.3
model:
  ngcAPISecret: ngc-api
persistence:
  enabled: false
imagePullSecrets:
  - name: ngc-secret

Spin it up the NIM

helm install my-nim nim-llm-1.7.0.tgz -f custom-values.yaml

After a few minutes your NIM should show up when running, we then test it by forwarding the container port to our machine.

kubectl port-forward service/my-nim-nim-llm 8000:http-openai

Then in a separate shell, we can test it by running:

curl -i http://localhost:8000/v1/models

If everything works, you should see something along the lines of:

HTTP/1.1 200 OK
date: Thu, 17 Apr 2025 21:40:52 GMT
server: uvicorn
content-length: 477
content-type: application/json

{"object":"list","data":[{"id":"meta/llama3-8b-instruct","object":"model","created":1744926052,"owned_by":"system","root":"meta/llama3-8b-instruct","parent":null,"permission":[{"id":"modelperm-47170d15fee9430eb42deda48f0f17b0","object":"model_permission","created":1744926052,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}%

Of course, you'll want to configure a proper ingress service like we did earlier for vLLM but this should at least give you some idea of how NIMs can be deployed in production.

Summing up

Regardless of which route you take to scaling your AI workloads in production, it's important to prioritize flexibility without compromising on resiliency and security.

It's also worth noting that inference, while an essential piece of the AI puzzle, is one of many, and LLMs on their own are only so useful.

As we've previously discussed, building effective AI tools may require multiple technologies and approaches, including fine-tuning, retrieval augmented generation, and, usually, a good bit of data prep.

The Register aims to bring you more on using LLMs and other AI technologies – without the hype – soon. We want to pull back the curtain and show how this stuff really fits together. If you have any burning questions on AI infrastructure, software, or models, we'd love to hear about them in the comments section below. ®

Editor's Note: The Register was provided an RTX 6000 Ada Generation graphics card by Nvidia, an Arc A770 GPU by Intel, and a Radeon Pro W7900 DS by AMD to support stories like this. None of these vendors had any input as to the content of this or other articles.

More about

TIP US OFF

Send us news


Other stories you might like