El Reg's essential guide to deploying LLMs in production
Running GenAI models is easy. Scaling them to thousands of users, not so much
Hands On You can spin up a chatbot with Llama.cpp or Ollama in minutes, but scaling large language models to handle real workloads – think multiple users, uptime guarantees, and not blowing your GPU budget – is a very different beast.
While a model might run efficiently on your PC with less than 4 GB of memory, the resources required to serve a model at scale are quite different. Deploying it in a production environment to handle numerous concurrent requests can require 40GB of GPU memory or more.
In this hands-on guide, we'll be taking a closer look at the avenues for scaling your AI workloads from local proofs of concept to production-ready deployments, and walk you through the process of deploying models like Gemma 3 or Llama 3.1 at scale.
Building with APIs
There are plenty of ways to integrate LLMs into your code base, but when it comes to deploying them in production, we strongly recommend using an OpenAI-compatible API.
This gives you flexibility to keep up with the rapidly changing model landscape. Models considered state of the art just six months ago are already ancient history. And since ChatGPT kicked off the AI boom in 2022, OpenAI's API interface has become the de facto standard for connecting apps to LLMs.
This approach also means you can start building applications that leverage LLMs using whatever's available. For example, you can start building with Mistral 7B in Llama.cpp on your notebook and then swap it out for Mistral AI's API servers when you're ready to deploy it in production. You're not stuck with one model, inference engine, or API provider.
Speaking of cloud-based inference services, this is usually the most capex-friendly way to scale up your AI deployments. There's no hardware to manage and no models to configure, just an API to point your app at.
In addition to the major model builders' API offerings, a growing pack of AI infra startups now offer inference-as-a-service for open-weight models.
However, not all of these providers are created equal. Some, like SambaNova, Cerebras, and Groq, use boutique hardware or techniques like speculative decoding to speed up inference, but offer a smaller catalog of models.
Others, like Fireworks AI, support the deploying custom fine-tuned models using Low Rank Adaptation (LoRA) adapters — a concept we've explored in depth previously. Given how diverse the AI ecosystem has become over the past two years, you'll want to do your research before committing to one over another.
So you want (or have) to do it yourself
If cloud-based approaches are off the table — whether it's for privacy or regulatory reasons or because your company already ordered a bunch of GPU servers — then you'll be looking at deploying on-prem. In that case, things can get tricky in a hurry. Let's start with some of the more common questions you're bound to run into.
What model should I use? This is going to depend on your use case. A model used primarily for a customer service chatbot is going to have different requirements than one used for retrieval augmented generation, or as a code assistant.
If you haven't already settled on a model, it's worth spending some time with API providers until you're confident you've found a model that will meet your needs.
How much hardware do I really need? This is a big question. GPUs are expensive and still in short supply. Thankfully, your model can tell you an awful lot about what it's going to take to run it.
Bigger models obviously need more hardware. But how much more? Well, you can get a rough estimate of the minimum GPU memory required by multiplying the parameter count (in billions) by 2GB.
For example, if you want to run a model like Llama 4 Maverick, you'll need a rough minimum of 800GB of GPU memory just to hold the weights. So you're likely looking at Nvidia HGX H200, B200 or AMD MI300X boxes from the get go. But if you're already getting good results with something like Phi 4 14B, then you may be able to get away with a pair of Nvidia L40s.
Note that this only applies to models trained at 16-bit precision. For 8-bit models, like DeepSeek-V3 or R1, you'll need just 1GB per billion parameters. And if you're willing to use model compression techniques like quantization, which trade quality for size, you can get this down to 512MB per billion parameters.
However, this is only a lower limit. In most cases, you'll need a fair bit more memory if you plan to serve that model to more than one person at a time. That's because in addition to the model's weights, you also have to account for the key-value cache. You can think of it like the model's short-term memory, and the more users and details it needs to keep track of, the more HBM or GDDR you'll need to budget for.
Nvidia's support matrix is a good place to start, as it offers some general guidance on which and how many GPUs you'll need to run many of the more popular open models.
Once you've sized your hardware to your model, you'll also need to think about redundancy. While you can certainly configure a single GPU node to run LLMs for business, what happens if it fails? So when sizing your deployment, you'll likely be looking at two more systems for failover and load balancing.
How should I deploy my model? There are several ways to deploy and serve LLMs in production. They can run on bare metal with old-fashioned load balancers distributing the load across them. You can deploy them in virtual machines, or you can spin up containers in Docker or Kubernetes and manage everything that way.
In this guide, we'll be looking at deploying LLMs using Kubernetes, as it neatly abstracts away much of the complexity associated with large scale deployments by automating container creation, networking, and load balancing.

How to spread an inference workload across three GPU nodes, to distribute the load and mitigate outages or downtime across the cluster ... Click to enlarge any image
That's not to say Kubernetes is easy. It's an whole can of worms unto itself, but it's one that many enterprises have already adopted and understand fairly well.
This is one of the reasons we've seen Nvidia, Hugging Face, and others gravitate toward containerized environments with their respective Nvidia Inference Microservices (NIMs) and adorably named HUGS (Hugging Face Generative AI Services), which have been preconfigured for common workloads and deployments. We'll take a look at how to use these a little later in this piece.
Which inference engine should I use? There's no shortage of inference engines for running models. We frequently use Ollama and Llama.cpp in our coverage as they'll work on just about anything, even a Raspberry Pi.
However, if your goal is to serve models at scale, then we tend to see libraries like vLLM, TensorRT LLM, SGLang and even PyTorch used more often.
For the purposes of this tutorial, we'll be looking at deploying models using vLLM, as it supports a wide selection of popular models and offers broad support and compatibility across Nvidia, AMD, and other hardware. However if you prefer to use something like SGLang or TensorRT LLM, most of the principals described here still apply.
Preparing our Kubernetes environment
Compared to your typical Kubernetes environment, getting pods to talk to your GPUs requires some additional drivers and dependencies. The process for setting this up is going to look different for AMD hardware than it would for Nvidia.
To keep things simple, we'll be deploying K3S in a single-node configuration. The basic steps are largely similar to multi-node environments, you'll just need to ensure the dependencies are satisfied on each GPU worker node, and may need some tweaks to the storage configuration.
Note: This is not intended to be an exhaustive guide on container orchestration and assumes some familiarity with Kubernetes concepts. Our goal is to provide a solid foundation for deploying inference workloads in a production-friendly manner. So while we will be walking through the steps to configure a GPU-accelerated Kubernetes node, your environment may look slightly different.
Prerequisites
For this guide you'll need:
- A server or workstation with at least one supported AMD or Nvidia GPU board.
- A fresh install of Ubuntu 24.04 LTS in order to keep dependencies manageable.
Nvidia dependencies
Setting up an Nvidia-accelerated K3S environment is relatively straightforward, but does involve a handful of dependencies.
We'll start by installing the CUDA Drivers Fabric Manager and Headless server drivers. In our case, the latest release available in the LTS repos is 570 so we'll go with that. We'll also install Nvidia's server utils as they are helpful for debugging any driver issues that crop up.
sudo apt install -y cuda-drivers-fabricmanager-570 nvidia-headless-570-server nvidia-utils-570-server
Next we'll add Nvidia's container runtime repo and pull down the latest version by running the following commands and then rebooting.
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \ && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \ sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list sudo apt-get update sudo apt-get install -y nvidia-container-runtime sudo reboot
Spinning Up K3S
With most of our dependencies set, we can move on to deploying K3S. For a single node deployment, this can be achieved by running:
curl -sfL https://get.k3s.io | sh -
Once the process completes, we can check to see if our node is ready to run on the machine with:
k3s kubectl get nodes
If everything worked correctly, you should see something like this:
NAME STATUS ROLES AGE VERSION kube-nv-prod Ready control-plane,master 3h45m v1.32.3+k3s1
If you're running an AMD Instinct box you can move on to the next section, but if you're using Nvidia there are few more tweaks we need to make.
By default, K3S will detect that the Nvidia container runtime is installed and configure itself accordingly. However, it's best to double check by running grep
. If everything worked properly you see the nvidia-container-runtime
listed below.
sudo grep nvidia /var/lib/rancher/k3s/agent/etc/containerd/config.toml sudo systemctl restart k3s
Deploying the GPU device driver
With K3S set up, we just need to give Kubernetes permission to allocate GPU resources to our pods. To do this we need to deploy the respective AMD or Nvidia device driver daemon to the cluster.
For Nvidia, this can be achieved by running:
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.1/deployments/static/nvidia-device-plugin.yml
AMD users, meanwhile, can install the drivers by running the following two commands:
kubectl create -f https://raw.githubusercontent.com/ROCm/k8s-device-plugin/master/k8s-ds-amdgpu-dp.yaml kubectl create -f https://raw.githubusercontent.com/ROCm/k8s-device-plugin/master/k8s-ds-amdgpu-labeller.yaml
Note: All the commands we're showing here can be run directly on the server, but if you prefer to access your cluster remotely via kubectl
, you can find the K3S config.yaml file under /etc/rancher/k3s/k3s.conf
After about a minute, we can check that our GPU has been detected by running:
kubectl describe node
If everything worked correctly, you should now see amd.com/gpu: 1
or nvidia.com/gpu: 1
under the Allocatable
header.
With that, your GPU nodes should be ready for deployment and you can move on to Deploying Gemma 3 4B in vLLM (Nvidia or AMD) or Getting started with NIMs (Nvidia).
Deploying Gemma 3 4B in vLLM
To get started we'll be looking at how to deploy Gemma 3 4B using vLLM on a Nvidia L40S or AMD MI210-class part, as well as how you'd tweak your deployment to run on a larger eight-GPU.
To get started, let's take a look at the manifest file.
vllm-openai-gemma.yaml (Nvidia)
apiVersion: apps/v1 kind: Deployment metadata: name: vllm-openai labels: app: vllm-openai spec: replicas: 1 selector: matchLabels: app: vllm-openai template: metadata: labels: app: vllm-openai spec: runtimeClassName: nvidia hostIPC: true containers: - name: vllm-openai image: vllm/vllm-openai:latest args: - "--model" - "google/gemma-3-4b-it" - "--served-model-name" - "Gemma 3 4B" - "--disable-log-requests" - "-tp" - "1" - "--max-num-seqs" - "8" #- "--max_num_batched_tokens" #- "16000" - "--max_model_len" - "16000" - "--api-key" - "$(API_KEY)" ports: - containerPort: 8000 env: - name: API_KEY valueFrom: secretKeyRef: name: vllm-api-key key: API_KEY - name: HUGGING_FACE_HUB_TOKEN valueFrom: secretKeyRef: name: hf-token key: HUGGING_FACE_HUB_TOKEN volumeMounts: - name: huggingface-cache mountPath: /root/.cache/huggingface - name: shm mountPath: /dev/shm volumes: - name: huggingface-cache hostPath: path: /root/.cache/huggingface type: Directory - name: shm emptyDir: medium: Memory sizeLimit: "10Gi" resources: limits: nvidia.com/gpu: 1 cpu: "12" memory: 20G requests: nvidia.com/gpu: 1 cpu: "8" memory: 10G
As manifests go, this one is pretty standard. It spins up a single pod running the Gemma3 4B model in vLLM on a single Nvidia GPU — we've included an AMD-specific manifest below, but same basic concepts will apply to both.
With that said, there are a few elements — other than the vLLM arguments, which we'll get to in a bit — worth highlighting.
Replicas: The number of replicas
dictates how many instances of the model we want to deploy. If you've got enough GPUs, you could have several instances of vLLM load balanced across a single server. For smaller models, like Gemma 3 4B, this is actually what AMD recommends.
Resources: Defines how much CPU, GPU, and system memory should be allocated to each pod. These need to be adjusted based on the size of your model.
If your LLM can't fit on a single GPU, you'll want to increase the number of GPUs it should request. If for example, you wanted to run DeepSeek R1 on an H200 or MI300X box with eight GPUs, you'd set the GPU limit
and request
both to 8.
vLLM configuration
Now, let's move on to the vLLM config itself. As you can see from the manifest, we've passed quite a few arguments to vLLM describing how we want to run the model. Some of these are required, like setting the model location — in this case it's pulling from Hugging Face — as well as the advertised model name, flags to disable log-requests, and the API key which we will pass along via a Kubernetes secret in a bit.
The rest will depend heavily on your specific hardware and use case. So let's dig into each.
tensor-parallel-size: Tensor parallelism enables us to distribute both the model weights and computational load across multiple GPUs.
This number should generally match the number of GPUs under the resources
section discussed above. In our case, since we're demonstrating on a single-GPU node, this will be set to 1
, which means it's effectively disabled.
max-num-seqs: Sets the upper limit for how many prompts or requests vLLM should process in a single batch. So if you set this to 8, vLLM will process at most 8 concurrent requests at once.
It can be more computationally efficient to set this to a higher number, but comes with the tradeoffs of higher memory consumption and longer wait times for responses to start streaming in.
So, if you know your LLM will be mostly handling a handful of requests at any given moment, it can actually be better to set this lower to reduce latency and memory consumption.
max_num_batched_tokens: Behaves similarly to --max-num-seqs
but rather than concurrent requests, caps the maximum number of tokens vLLM should process at any one time.
The proper setting here will depend on whether you're optimizing for latency or throughput.
A smaller batch size puts more computational load on the system but also reduces memory consumption and latency. A bigger batch size, meanwhile, allows more tokens to be processed at once, allowing it to serve more users, but potentially increasing their wait time.
If you're unsure how to set this, you can actually omit it from the config and vLLM will automatically adjust the batch size based on the available resources.
max_model_len: This defines the maximum number of tokens each sequence should keep track of, and goes hand in hand with the --max_num_seqs
we set earlier.
You can think of this a bit like a workbench. The --max_model_len
describes how big each workbench is — or rather can be — and the --max_num_seqs
describe how many workbenches are available for folks to use at any given moment. For a given space (memory) you can either have fewer bigger workbenches or loads of tiny ones.
vLLM will, by default, set this to the maximum context window supported by the model. On older models this was usually small, but these days, context windows routinely exceeded 128,000 tokens and some are now pushing a million-plus. Those tokens can take up a lot of memory, so, it's often necessary or even desirable to set the --max_model_len
to a smaller value.
The graphic below details how these parameters impact memory requirements:
vllm-openai-gemma.yaml (AMD)
As we mentioned earlier, if you're deploying workloads on AMD Instinct accelerators, the manifest file is going to look a little different.
apiVersion: apps/v1 kind: Deployment metadata: name: vllm-openai spec: replicas: 1 selector: matchLabels: app: vllm-openai template: metadata: labels: app: vllm-openai spec: hostIPC: true containers: - name: vllm-openai image: rocm/vllm:instinct_main securityContext: seccompProfile: type: Unconfined capabilities: add: - SYS_PTRACE command: ["/bin/sh", "-c"] args: - > vllm serve google/gemma-3-4b-it \ --served-model-name 'Gemma 3 4B' \ --disable-log-requests \ -tp 1 \ --max-num-seqs 8 \ --max_model_len 16000 \ --api-key $(API_KEY) ports: - containerPort: 8000 env: - name: VLLM_USE_TRITON_FLASH_ATTN value: "0" - name: API_KEY valueFrom: secretKeyRef: name: vllm-api-key key: API_KEY - name: HUGGING_FACE_HUB_TOKEN valueFrom: secretKeyRef: name: hf-token key: HUGGING_FACE_HUB_TOKEN volumeMounts: - name: huggingface-cache mountPath: /root/.cache/huggingface - name: shm mountPath: /dev/shm resources: limits: amd.com/gpu: "1" cpu: "12" memory: "20Gi" requests: amd.com/gpu: "1" cpu: "8" memory: "10Gi" volumes: - name: huggingface-cache hostPath: path: /home/tobiasmann/.cache/huggingface type: Directory - name: shm emptyDir: medium: Memory sizeLimit: "10Gi"
Note: This manifest is specific to running vLLM on Instinct accelerators. If you'd hoped to test things out on AMD workstation cards like the W7900, you'll need to build a compatible vLLM container from source by following this guide here, and then push it to your container registry of choice.
Tuning vLLM
While Gemma 3 4B is a pretty small model, requiring a little over 8GB of vRAM to fit the model weights at their native BF16 resolution, trying to use the manifest we just looked at on a 24GB Nvidia L4 will likely result in an out of memory (OOM) error.
If that's the case for you, you'll need to either lower the --max_num_seq
, --max_model_len
, or --max_num_batch_tokens
, or some combination of the three.
These parameters can have a big impact on performance and user experience, so let's take a look at two different scenarios to see how you tune them differently.
Scenario 1: Corporate summarization assistant
Let's say you're building a genAI application to help search and summarize large documents for a team of 12. The likelihood that you'll have more than two people running a summarization task simultaneously will be pretty low, so we can set --max_num_seqs
to something like two.
However because these documents are so big, you'll want to set --max_model_len
and --max_num_batch_tokens
large enough that the document fits within the context window without getting cut off. If your largest document is around 15,000 words, then you might want to set your --max_model_len
to 32,000 to give yourself a buffer — remember that tokens not only represent words but punctuation marks too.
Intuitively, you might think if you have two 32,000 token sequences, you'd want to set your --max_num_batch_tokens
to 64,000. However, that's a worse-case scenario, since not every document is going to be 15,000-plus words. For example, if one user summarized a 10,000-word doc and the other a 3,000-word doc, you wouldn't get anywhere close to the cap. Thankfully, vLLM takes a lot of the guesswork out of setting this parameter. If --max_num_batch_tokens
is left unset it'll automatically rightsize itself based on the available memory.
Scenario 2: Customer service chatbot

For a customer service chatbot, prioritize a larger number of smaller sequences to maximize the number of concurrent users
Now, let's imagine you're building a customer service chatbot to help users learn about your product or overcome common challenges. Your parameters are going to look quite a bit different from the summarization scenario.
For one, the prompts and responses are going to be much shorter, but you'll likely be serving a larger number of concurrent users. In this case it might make sense to have a large --max_num_seqs
like 16, 32 or more but a smaller --max_model_len
of 1,024 or 2,048.
And again here, we can let vLLM figure out how to set our --max_num_batch_tokens
for us.
Benchmarking
Depending on your specific use case, it'll likely be prudent to benchmark a couple of different configurations until you find one that achieves the desired combination of overall throughput, time-to-first token (the time folks have to wait for the chatbot to start responding), or second-token latencies (how quickly the answer generates).
Final preparations: With our vLLM manifest tuned to our liking, we can move on to spinning up vLLM. But first, we'll need to generate two secrets. One is for our Hugging Face token — Gemma 3 is a gated model, so don't forget to request access to the repo page first — and the second you'll use to access the vLLM API server later.
This can be achieved by running the following two commands swapping out HUGGING_FACE_TOKEN_HERE
and TOP_SECRET_KEY_HERE
for your token and key.
kubectl create secret generic hf-token --from-literal=HUGGING_FACE_HUB_TOKEN=HUGGING_FACE_TOKEN_HERE kubectl create secret generic vllm-api-key --from-literal=API_KEY=TOP_SECRET_KEY_HERE
Finally, you'll want to make sure the Hugging Face cache directory exists:
mkdir -p /root/.cache/huggingface
Note: To keep things simple, we're using bind mounts, but there's nothing stopping you from using NFS, a persistent volume store, or some other storage interface to maintain the cached Safetensor files from Hugging Face. You could also just remove that particular bind mount entirely. Just remember that if you do, vLLM will need to re-download them each time a new pod is spun up.
Spinning up vLLM
With that out of the way, we can spin up Gemma 3 4B in vLLM by running:
kubectl apply -f vllm-openai-gemma.yaml
We can then check the deployment by running a get pods
command in kubectl
.
kubectl get pods
You should see some vllm-openai-...
"containerCreating." Once it shows as "running," we can then check the vLLM logs to see if there were any problems.
kubectl logs -l app=vllm-openai -f
After a few minutes, you should see:
INFO: Application startup complete.
Ingress and load balancing
With the server up and running, you can now configure your Kubernetes ingress and load balancer as you would with any other container deployment. How this is done is going to depend on your specific Kubernetes environment, which ingress controller and load balancer you're using, and your security policies.
The basic idea here is that Kubernetes should expose an API address, and Kubernetes will automatically balance the load across all available GPU nodes.
If you've been following along and just want to make sure the vLLM server is working as intended, you can spin up a quick-and-dirty ingress service by running the commands below. Note that if you are going to replicate this, your front-end application needs to be running on the same subnet behind your network's DMZ, as it's served over standard HTTP. With those precautions out of the way, let's get into it.
We'll start by creating a new ClusterIP service by creating another YAML file called vllm-openai-svc.yaml
containing the following:
apiVersion: v1 kind: Service metadata: name: vllm-openai-svc namespace: default spec: type: ClusterIP selector: app: vllm-openai ports: - port: 8000 targetPort: 8000
Then to create the service we'll run:
kubectl apply -f vllm-openai-svc.yaml
Next we'll load our ingress configuration, creating a separate YAML file called vllm-openai-ingressroute.yaml
containing the following, replacing YOUR_DOMAIN_NAME_HERE
with the domain you plan to use.
apiVersion: traefik.io/v1alpha1 kind: IngressRoute metadata: name: vllm-openai namespace: default spec: entryPoints: - web routes: - match: Host(YOUR_DOMAIN_NAME_HERE) kind: Rule services: - name: vllm-openai-svc port: 8000 sticky: cookie: name: VLLMSESSION secure: true httpOnly: true
We can then apply it by running:
kubectl apply -f vllm-openai-ingressroute.yaml
You can now update your local DNS server to point to the domain you set earlier. Modifying /etc/hosts
If you don't have access to your local DNS server (usually handled as part of the router config or as a standalone server), you can modify your local machine's /etc/hosts
file in Linux to point to the domain to your Kuberentes node's IP.
sudo nano /etc/hosts
Append the following and then save and exit.
NODE_IP YOUR_DOMAIN_NAME_HERE YOUR_DOMAIN_NAME_HERE
Testing it out:
With everything configured we can now test that the server is running by running:
export VLLM_API_KEY="TOP_SECRET_KEY_HERE" curl -i http://api.local.rambler.ink/v1/models \ -H "Authorization: Bearer $VLLM_API_KEY"
If everything worked you should see something along the lines of:
HTTP/1.1 200 OK Content-Length: 481 Content-Type: application/json Date: Wed, 16 Apr 2025 20:06:56 GMT Server: uvicorn Set-Cookie: VLLMSESSION=48de0c44ee42ce61; Path=/; HttpOnly; Secure {"object":"list","data":[{"id":"Gemma 3 4B","object":"model","created":1744834016,"owned_by":"vllm","root":"google/gemma-3-4b-it","parent":null,"max_model_len":16000,"permission":[{"id":"modelperm-e17493c83c8047c1b8ce3b082e4c4a61","object":"model_permission","created":1744834016,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}
Getting started with NIMs
The sheer number of levers and knobs necessary to flip and turn in order to achieve optimal throughput or latency is one of the reasons why Nvidia, Hugging Face, and others have gravitated toward pre-baked model containers that require little to no configuration to get up and running with.
Nvidia calls these Inference Microservices (NIMs) for short, while Hugging Face calls their version of these containers Hugs. These microservices aren't free. You can play with NIMs in a dev environment, but if you want to deploy them in production you'll need an AI Enterprise license that'll set you back $4,500/year per GPU or $1/hour per GPU in the cloud.
However, if you're already paying for said license, they're a no-brainer.
We'll go over the basics of deploying an NIM here, but you'll definitely want to check out Nvidia's docs for specifics on how to tune your configuration to best suit your Kubernetes environment.
Grabbing the dependencies
Deploying NIM on our Kubernetes cluster requires a few additional dependencies, namely Helm and the Nvidia GPU operator. To install Helm, we can simply run:
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \ && chmod 700 get_helm.sh \ && ./get_helm.sh
We can then add the Helm repo for GPU Operator and install it:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \ && helm repo update helm install --wait --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ --version=v25.3.0
Adding the NGC Repo and API key
Before you can download any NIMs you'll need to generate an NGC API key by following the instructions here, and add it to your .bashrc
or .zsh
file.
export NGC_API_KEY=NGC_API_KEY_HERE echo "export NGC_API_KEY=VALUE" >> ~/.bashrc
or
echo "export NGC_API_KEY=VALUE" >> ~/.zsh
Grab the Helm chart and add your customizations
Next we'll download the NIM LLM Helm chart. We're using version 1.7.0 but you can find the latest version here.
helm fetch https://helm.ngc.nvidia.com/nim/charts/nim-llm-1.7.0.tgz --username='$oauthtoken' --password=$NGC_API_KEY
Next we'll add the NGC repo and API key as secrets on our Kubernetes cluster.
kubectl create secret docker-registry ngc-secret --docker-server=nvcr.io --docker-username='$oauthtoken' --docker-password=$NGC_API_KEY kubectl create secret generic ngc-api --from-literal=NGC_API_KEY=$NGC_API_KEY
Finally we'll create a configuration file for our deployment, which in this case is Llama 3.1 8B and save it as custom-values.yaml
.
image: repository: "nvcr.io/nim/meta/llama3-8b-instruct" tag: 1.0.3 model: ngcAPISecret: ngc-api persistence: enabled: false imagePullSecrets: - name: ngc-secret
Spin it up the NIM
helm install my-nim nim-llm-1.7.0.tgz -f custom-values.yaml
After a few minutes your NIM should show up when running, we then test it by forwarding the container port to our machine.
kubectl port-forward service/my-nim-nim-llm 8000:http-openai
Then in a separate shell, we can test it by running:
curl -i http://localhost:8000/v1/models
If everything works, you should see something along the lines of:
HTTP/1.1 200 OK date: Thu, 17 Apr 2025 21:40:52 GMT server: uvicorn content-length: 477 content-type: application/json {"object":"list","data":[{"id":"meta/llama3-8b-instruct","object":"model","created":1744926052,"owned_by":"system","root":"meta/llama3-8b-instruct","parent":null,"permission":[{"id":"modelperm-47170d15fee9430eb42deda48f0f17b0","object":"model_permission","created":1744926052,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}%
Of course, you'll want to configure a proper ingress service like we did earlier for vLLM but this should at least give you some idea of how NIMs can be deployed in production.
Summing up
Regardless of which route you take to scaling your AI workloads in production, it's important to prioritize flexibility without compromising on resiliency and security.
It's also worth noting that inference, while an essential piece of the AI puzzle, is one of many, and LLMs on their own are only so useful.
As we've previously discussed, building effective AI tools may require multiple technologies and approaches, including fine-tuning, retrieval augmented generation, and, usually, a good bit of data prep.
The Register aims to bring you more on using LLMs and other AI technologies – without the hype – soon. We want to pull back the curtain and show how this stuff really fits together. If you have any burning questions on AI infrastructure, software, or models, we'd love to hear about them in the comments section below. ®
Editor's Note: The Register was provided an RTX 6000 Ada Generation graphics card by Nvidia, an Arc A770 GPU by Intel, and a Radeon Pro W7900 DS by AMD to support stories like this. None of these vendors had any input as to the content of this or other articles.