A friendly guide to containerization for AI work
Save the headaches, ship your dependencies
Hands on One of the biggest headaches associated with AI workloads is wrangling all of the drivers, runtimes, libraries, and other dependencies they need to run.
This is especially true for hardware-accelerated tasks where if you've got the wrong version of CUDA, ROCm, or PyTorch there's a good chance you'll be left scratching your head while staring at an error.
If that weren't bad enough, some AI projects and apps may have conflicting dependencies, while different operating systems may not support the packages you need. However, by containerizing these environments we can avoid a lot of this mess by building images that have been configured specifically to a task and - perhaps more importantly - be deployed in a consistent and repeatable manner each time.
And because the containers are largely isolated from one another, you can usually have apps running with conflicting software stacks. For example you can have two containers, one with CUDA 11 and the other with 12, running at the same time.
This is one of the reasons that chipmakers often make containerized versions of their accelerated-computing software libraries available to users since it offers a consistent starting point for development.
Prerequisites
In this tutorial we'll be looking at a variety of ways that containerization can be used to assist in the development and/or deployment of AI workloads whether they be CPU or GPU accelerated.
This guide assumes that you:
- Are running on Ubuntu 24.04 LTS (Other distros should work, but your mileage may vary).
- Have the latest release of Docker Engine installed and a basic understanding of the container runtime.
- Are running Nvidia's proprietary drivers, if applicable.
While there are a number of container environments and runtimes, we'll be looking specifically at Docker for its simplicity and broad compatibility. Having said that, many of the concepts shown here will apply to other containerization runtimes such as Podman, although the execution may be a little different.
Exposing Intel and AMD GPUs to Docker
Unlike virtual machines, you can pass your GPU through to as many containers as you like, and so long as you don't exceed the available vRAM you shouldn't have an issue.
For those with Intel or AMD GPUs, the process couldn't be simpler and simply involves passing the right flags when spinning up our container.
For example, let's say we want to make make your Intel GPU available to an Ubuntu 22.04 container. You'd append --device /dev/dri
to the docker run
command. Assuming you're on a bare metal system with an Intel GPU, you'd run something like:
docker run -it --rm --device /dev/dri ubuntu:22.04
Meanwhile, for AMD GPUs you'd append --device /dev/kfd
docker run -it --rm --device /dev/kfd --device /device/dri ubuntu:22.04
Note: Depending on your system you'll probably need to run this command with elevated privileges using sudo docker run
or in some cases doas docker run
.
Exposing Nvidia GPUs to Docker
If you happen to be running one of Team Green's cards, you'll need to install the Nvidia Container Toolkit before you can expose it to your Docker containers.
To get started, we'll add the software repository for the toolkit to our sources list and refresh Apt. (You can see Nvidia's docs for instructions on installing on RHEL and SUSE-based distros here.)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \ && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \ sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update
Now we can install the container runtime and configure Docker to use it.
sudo apt install -y nvidia-container-toolkit
With the container toolkit installed, we just need to tell Docker to use the Nvidia runtime by editing the /etc/docker/daemon.json
file. To do this, we can simply execute the following:
sudo nvidia-ctk runtime configure --runtime=docker
The last step is to restart the docker daemon and test that everything is working by launching a container with the --gpus=all
flag.
sudo systemctl restart docker
docker run -it --rm --gpus=all ubuntu:22.04
Note: If you have multiple GPUs you can specify which ones to expose by using the gpus=1
or gpus '"device=1,3,4"'
flags.
Inside the container, you can then run nvidia-smi
and you should see something similar appear on your screen.
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX 6000 Ada Gene... Off | 00000000:06:10.0 Off | Off |
| 30% 29C P8 9W / 300W | 8045MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 941 C python3 7506MiB |
| 0 N/A N/A 40598 C /usr/local/bin/python3 528MiB |
+-----------------------------------------------------------------------------------------+
Using Docker containers as dev environments
One of the most useful applications of Docker containers when working with AI software libraries and models is as a development environment. This is because you can spin up as many containers as you need and tear them down when you're done without worrying about borking your system.
Now, you can just spin up a base image of your distro of choice, expose our GPU to it, and start installing CUDA, ROCm, PyTorch, or Tensorflow. For example, to create a basic GPU accelerated Ubuntu container you’d run the following (remember to change the --gpus
or --device
flag appropriately) to create and then access the container.
docker run -itd --gpus=all -p 8081:80 -v ~/denv:/home/denv --name GPUtainer ubuntu:22.04
docker exec -it GPUtainer /bin/bash
This will create a new Ubuntu 22.04 container named GPUtainer that:
- Has access to your Nvidia GPU
- Exposes port 80 on the container as port 8081 on your host
- Mounts
/home/denv
in container as adenv
folder in your host's home directory for easy file transfer - Continues running after you exit
Using prebuilt images
While building up a container from scratch with CUDA, ROCm, or OpenVINO can be useful at times, it’s also rather tedious and time consuming, especially when there are prebuilt images out there that'll do most of the work for you.
For example, if we want to get a basic CUDA 12.5 environment up and running we can use a nvidia/cuda
image as a starting point. To test it run:
docker run -it --gpus=all -p 8081:80 -v ~/denv:/home/denv --name CUDAtainer nvidia/cuda:12.5.0-devel-ubuntu22.04
Or, if you’ve got and AMD card, we can use one of the ROCm images like this ROCm/dev-ubuntu-22.04
one.
docker run -it --device /dev/kfd --device /device/dri -p 8081 -v ~/denv:/home/denv —-name ROCmtainer ROCm/dev-ubuntu-22.04
Meanwhile, owners of Intel GPU should be able to create a similar environment using this OpenVINO image.
docker run -it --device /dev/dri:/dev/dri -p 8081 -v ~/denv:/home/denv —-name Vinotainer openvino/ubuntu22_runtime:latest
Converting your containers into images
By design, Docker containers are largely ephemeral in nature, which means that changes to them won’t be preserved if, for example, you were to delete the container or update the image. However, we can save any changes committing them to a new image.
To commit changes made to the CUDA dev environment we created in the last step we’d run the following to create a new image called "cudaimage".
docker commit CUDAtainer cudaimage
We could then spin up a new container based on it by running:
docker run -itd --gpus=all -p 8082:80 -v ~/denv:/home/denv --name CUDAtainer2 cudaimage
Building custom images
Converting existing containers into reproducible images can be helpful for creating checkpoints and testing out changes. But, if you plan to share your images, it's generally best practice to show your work in the form of a dockerfile
.
This file is essentially just a list of instructions that typically tells Docker how to turn an existing image into a custom one. As with much of this tutorial, if you're at all familiar with Docker or the docker build
command most of this should be self explanatory.
For those new to generating Docker images, we'll go through a simple example using this AI weather app we kludged together in Python. It uses Microsoft's Phi3-instruct LLM to generate a human-readable report from stats gathered from Open Weather Map every 15 minutes in the tone of a TV weather personality.
import json import time from typing import Dict, Any import requests import torch from transformers import pipeline, BitsAndBytesConfig # Constants ZIP_CODE = YOUR_ZIP_CODE API_KEY = "YOUR_OPEN_WEATHER_MAP_API_KEY" # Replace with your OpenWeatherMap API key WEATHER_URL = f"http://api.openweathermap.org/data/2.5/weather?zip={ZIP_CODE}&appid={API_KEY}" UPDATE_INTERVAL = 900 # seconds # Initialize the text generation pipeline quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16) pipe = pipeline("text-generation", "microsoft/Phi-3-mini-4k-instruct", device_map="auto", model_kwargs={"quantization_config": quantization_config}) def kelvin_to_fahrenheit(kelvin: float) -> float: """Convert Kelvin to Fahrenheit.""" return (kelvin - 273.15) * 9/5 + 32 def get_weather_data() -> Dict[str, Any]: """Fetch weather data from OpenWeatherMap API.""" response = requests.get(WEATHER_URL) response.raise_for_status() return response.json() def format_weather_report(weather_data: Dict[str, Any]) -> str: """Format weather data into a report string.""" main_weather = weather_data['main'] location = weather_data['name'] conditions = weather_data['weather'][0]['description'] temperature = kelvin_to_fahrenheit(main_weather['temp']) humidity = main_weather['humidity'] wind_speed = weather_data['wind']['speed'] return (f"The time is: {time.strftime('%H:%M')}, " f"location: {location}, " f"Conditions: {conditions}, " f"Temperature: {temperature:.2f}°F, " f"Humidity: {humidity}%, " f"Wind Speed: {wind_speed} m/s") def generate_weather_report(weather_report: str) -> str: """Generate a weather report using the text generation pipeline.""" chat = [ {"role": "assistant", "content": "You are a friendly weather reporter that takes weather data and turns it into short reports. Keep these short, to the point, and in the tone of a TV weather man or woman. Be sure to inject some humor into each report too. Only use units that are standard in the United States. Always begin every report with 'in (location) the time is'"}, {"role": "user", "content": f"Today's weather data is {weather_report}"} ] response = pipe(chat, max_new_tokens=512) return response[0]['generated_text'][-1]['content'] def main(): """Main function to run the weather reporting loop.""" try: while True: try: weather_data = get_weather_data() weather_report = format_weather_report(weather_data) generated_report = generate_weather_report(weather_report) print(generated_report) except requests.RequestException as e: print(f"Error fetching weather data: {e}") except Exception as e: print(f"An unexpected error occurred: {e}") time.sleep(UPDATE_INTERVAL) except KeyboardInterrupt: print("\nWeather reporting stopped.") if __name__ == "__main__": main()
Note: If you are following along, be sure to set your zip code and Open Weather Map API key appropriately.
If you're curious, the app works by passing the weather data and instructions to the LLM via the Transformers pipeline module, which you can learn more about here.
On its own, the app is already fairly portable with minimal dependencies. However, it still relies on the CUDA runtime being installed correctly, something we can make easier to manage by containerizing the app.
To start, in a new directory create an empty dockerfile
alongside the weather_app.py
Python script above. Inside the dockerfile
we'll define which base image we want to start with, as well as the working directory we'd like to use.
FROM nvidia/cuda:12.5.0-devel-ubuntu22.04 WORKDIR /ai_weather
Below this, we'll tell the Dockerfile to copy the weather_app.py
script to the working directory.
ADD weather_app.py /ai_weather/
From here, we simply need to tell it what commands it should RUN
to set up the container and install any dependencies. In this case, we just need a few Python modules, as well as the latest release of PyTorch for our GPU.
RUN apt update RUN apt upgrade -y RUN apt install python3 python3-pip -y RUN pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124 RUN pip3 install requests accelerate transformers RUN pip3 install bitsandbytes>=0.39.0 -q
Finally, we'll set the CMD
to the command or executable we want the container to run when it's first started. With that, our dockerfile
is complete and should look like this:
FROM nvidia/cuda:12.5.0-devel-ubuntu22.04 WORKDIR /ai_weather ADD weather_app.py /ai_weather/ RUN apt update RUN apt upgrade -y RUN apt install python3 python3-pip -y RUN pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124 RUN pip3 install requests accelerate transformers RUN pip3 install bitsandbytes>=0.39.0 -q CMD ["/bin/bash", "-c", "python3 weather_app.py"]
Now all we have to do is convert the dockerfile
into a new image by running the following, and then sit back and wait.
docker build -t aiweather .
After a few minutes, the image should be complete and we can use it to spin up our container in interactive mode. Note: Remove the --rm
bit if you don't want the container to destroy itself when stopped.
docker run -it --rm --gpus=all aiweather
After a few seconds the container will launch, download Phi3 from Hugging Face, quantize it to 4-bits precision, and present our first weather report.
"In Aurora, the time is 2:28 PM, and it's a hot one! We've got scattered clouds playing hide and seek, but don't let that fool you. It's a scorcher at 91.69°F, and the air's as dry as a bone with just 20% humidity. The wind's blowing at a brisk 6.26 m/s, so you might want to hold onto your hats! Stay cool, Aurora!"
Naturally, this is an intentionally simple example, but hopefully it illustrates how containerization can be used to make running AI apps easier to build and deploy. We recommend taking a look at Docker's documentation here, if you need anything more intricate.
- Bake an LLM with custom prompts into your app? Sure! Here's how to get started
- GPU-accelerated VMs on Proxmox, XCP-ng? Here's what you need to know
- How to run an LLM on your PC, not in the cloud, in less than 10 minutes
- From RAGs to riches: A practical guide to making your local AI chatbot smarter
What about NIMs
Like any other app, containerizing your AI projects has a number of advantages beyond just making them more reproducible and easier to deploy at scale, it also allows models to be shipped alongside optimized configurations for specific use cases or hardware configurations.
This is the idea behind Nvidia Inference Microservices — NIMs for short — which we looked at back at GTC this spring. These NIMs are really just containers built by Nvidia with specific versions of software such as CUDA, Triton Inference Server, or TensorRT LLM that have been tuned to achieve the best possible performance on their hardware.
And since they're built by Nvidia, every time the GPU giant releases an update to one of its services that unlocks new features or higher performance on new or existing hardware, users will be able to take advantage of these improvements simply by pulling down a new NIM image. Or that's the idea anyway.
Over the next couple of weeks, Nvidia is expected to make its NIMs available for free via its developer program for research and testing purposes. But before you get too excited, if you want to deploy them in production you're still going to need a AI Enterprise license which will set you back $4,500/year per GPU or $1/hour per GPU in the cloud.
We plan to take a closer look at Nvidia's NIMs in the near future. But, if an AI enterprise license isn't in your budget, there's nothing stopping you from building your own optimized images, as we've shown in this tutorial. ®
Editor's Note: Nvidia provided The Register with an RTX 6000 Ada Generation graphics card to support this story and others like it. Nvidia had no input as to the contents of this article.