Nvidia: Why write code when you can string together a couple chat bots?

GPU giant says NIM will eliminate dependency headaches for the low low cost of $4,500/year per GPU

Nevermind using large language models (LLMs) to help write code, Nvidia CEO Jensen Huang believes that in the future, enterprise software will just be a collection of chat bots strung together to complete the task.

"It is unlikely that you'll write it from scratch or write a whole bunch of Python code or anything like that," he said on stage during his GTC keynote Monday. "It is very likely that you assemble a team of AI."

This AI team, Jensen explains, might include a model designed to break down and delegate a request to various other models. Some of these models might be trained to understand business services like SAP or Service Now, while others might perform numerical analysis on data stored in a vector database. This data can then be combined and presented to the end user by yet another model.

"We can get a report every single day or you know, top of the hour that has something to do with a build plan, or some forecast, or some customer alert, or some bugs database or whatever it happens to be," he explained

To chain all these models together, Nvidia is taking a page out of Docker's book and has created a container runtime for AI.

Dubbed Nvidia Inference Microservices, or NIM for short, these are essentially container images containing both the model, whether it be the open source or proprietary, along with all the dependencies necessary to get it running. These containerized models can then be deployed across any number of runtimes, including Nvidia-accelerated Kubernetes nodes.

"You can deploy it on our infrastructure called DGX Cloud, or you can deploy it on prem, or you can deploy it anywhere you like. Once you develop it, it's yours to take anywhere," Jensen said.

Of course, you'll need a subscription to Nvidia's AI Enterprise suite first, which isn't exactly cheap at $4,500/year per GPU or $1/hour per GPU in the cloud. This pricing strategy would seem to incentivize denser higher performance systems in general as it costs the same regardless of whether you're running on L40s or B100s.

If the idea of containerizing GPU accelerated workloads sounds familiar, this isn't exactly a new idea for Nvidia. CUDA acceleration has been supported on a wide variety of container runtimes, including Docker, Podman, Containerd, or CRI-O for years, and it doesn't look like Nvidia's Container Runtime is going anywhere.

The value proposition behind NIM appears to be that Nvidia will handle the packaging and optimization of these models so that they have the right version of CUDA, Triton Inference Server, or TensorRT LLM, necessary to get the best performance out of them.

The argument being that if Nvidia releases an update that dramatically boosts the inference performance of certain model types, taking advantage of that functionality would just require pulling down the latest NIM image.

In addition to hardware specific model optimizations, Nvidia is also working on enabling consistent communications between containers, so that they can chat with each other, via API calls.

As we understand it, the API calls used by the various AI models on the market today aren't always consistent resulting in it being easier to string together some models and while others may require additional work.

Lending institutional knowledge to general purpose models

Anyone who has used an AI chatbot will know that while they're usually pretty good with general knowledge questions, they aren't always the most reliable with obscure or technical requests.

Jensen highlighted this fact during his keynote. Asked about an internal program used within Nvidia, Meta's Llama 2 70B large language model unsurprisingly provided the definition to an unrelated term.

Instead of trying to get enterprises to train their own models — something that would sell a lot of GPUs but would limit the addressable market considerably — Nvidia has developed tools to fine tune its NIMs with customer data and processes.

"We have a service called NeMo Microservices that helps you curate the data, prepare the data so that you can… onboard this AI. You fine tune it and then you guardrail it; you can then evaluate… its performance against other other examples," Huang explained.

He also talked up Nvidia's NeMo Retriever service which is based around the concept of using retrieval augmented generation (RAG) to surface information that the model hasn't been specifically trained on.

The idea here is that documents, processes, and other data can be loaded into a vector database that is connected to the model. Based on a query, the model can then search that database, retrieve, and summarize the relevant information.

NIM models and NeMo Retriever for integrating RAGs are available now, while NeMo Microservices is in early access. ®

More about


Send us news

Other stories you might like