Nvidia software exec Kari Briski on NIM, CUDA, and dogfooding AI

A RAGs to riches story

Interview Nvidia's GPU Technology Conference concluded last week, bringing word of the super-corp's Blackwell chips and the much-ballyhooed wonders of AI, with all the dearly purchased GPU hardware that implies.

Such is the buzz around the chip biz that its stock price is flirting with record highs, based on the notion that many creative endeavors can be made faster if not better with the automation enabled by machine learning models.

That's still being tested in the market.

George Santayana once wrote: "Those who cannot remember the past are condemned to repeat it." It is a phrase often repeated. Yet remembrance of things past hasn't really set AI models apart. They can remember the past but they're still condemned to repeat it on demand, at times incorrectly.

Even so, many swear by almighty AI, particularly those selling AI hardware or cloud services. Nvidia, among others, is betting big on it. So The Register made a brief visit to the GPU conference to see what all the fuss was about. It was certainly not about the lemon bars served in the exhibit hall on Thursday, many of which ended their initial public offering unfinished in show floor bins.

Far more engaging was a conversation The Register had with Kari Briski, vice president of product management for AI and HPC software development kits at Nvidia. She heads up software product management for the company's foundation models, libraries, SDKs, and now microservices that deal with training and inference, like the newly announced NIM microservices and the better established NeMo deployment framework.

The Register: How are companies going to consume these microservices – in the cloud, on premises?

Briski: That's actually the beauty of why we built the NIMs. It's kind of funny to say "the NIMs." But we started this journey a long time ago. We've been working in inference since I started – I think it was TensorRT 1.0 when I started 2016.

Over the years we have been growing our inference stack, learning more about every different kind of workload, starting with computer vision and deep recommender systems and speech, automatic speech recognition and speech synthesis and now large language models. It's been a really developer-focused stack. And now that enterprises [have seen] OpenAI and ChatGPT, they understand the need to have these large language models running next to their enterprise data or in their enterprise applications.

The average cloud service provider, for their managed services, they've had hundreds of engineers working on inference, optimization techniques. Enterprises can't do that. They need to get the time-to-value right away. That's why we encapsulated everything that we've learned over the years with TensorRT, large language models, our Triton Inference Server, standard API, and health checks. [The idea is to be] able to encapsulate all that so you can get from zero to a large language model endpoint in under five minutes.

[With regard to on-prem versus cloud datacenter], a lot of our customers are hybrid cloud. They have preferred compute. So instead of sending the data away to a managed service, they can run the microservice close to their data and they can run it wherever they want.

The Register: What does Nvidia's software stack for AI look like in terms of programming languages? Is it still largely CUDA, Python, C, and C++? Are you looking elsewhere for greater speed and efficiency?

Briski: We're always exploring wherever developers are using. That has always been our key. So ever since I started at Nvidia, I've worked on accelerated math libraries. First, you had to program in CUDA to get parallelism. And then we had C APIs. And we had a Python API. So it's about taking the platform wherever the developers are. Right now, developers just want to hit a really simple API endpoint, like with a curl command or a Python command or something similar. So it has to be super simple, because that's kind of where we're meeting the developers today.

The Register: CUDA obviously plays a huge role in making GPU computation effective. What's Nvidia doing to advance CUDA?

Briski: CUDA is the foundation for all our GPUs. It's a CUDA-enabled, CUDA-programmable GPU. A few years ago, we called it CUDA-X, because you had these domain-specific languages. So if you have a medical imaging [application], you have cuCIM. If you have automatic speech recognition, you have a CUDA accelerated beam search decoder at the end of it. And so there's all these specific things for every different type of workload that have been accelerated by CUDA. We've built up all these specialized libraries over the years like cuDF and cuML, and cu-this-and-that. All these CUDA libraries are the foundation of what we built over the years and now we're kind of building on top of that.

The Register: How does Nvidia look at cost considerations in terms of the way it designs its software and hardware? With something like Nvidia AI Enterprise, it's $4,500 per GPU every year, which is considerable.

Briski: First, for smaller companies, we always have the Inception program. We are always working with customers – a free 90-day trial, is it really valuable to you? Is it really worth it? Then, for reducing your costs when you buy into that, we are always optimizing our software. So if you were buying the $4,500 per GPU per year per license, and you're running on an A100, and you run on an H100 tomorrow, it's the same price – your cost has gone down [relative to your throughput]. So we're always building those optimizations and total cost of ownership and performance back into the software.

When we're thinking about both training and inference, the training does take a little bit more, but we have these auto configurators to be able to say, "How much data do you have? How much compute do you need? How long do you want it to take?" So you can have a smaller footprint of compute, but it just might take longer to train your model … Would you like to train it in a week? Or would you like to train it in a day? And so you can make those trade offs.

The Register: In terms of current problems, is there anything particular you'd like to solve or is there a technical challenge you'd like to overcome?

Briski: Right now, it's event-driven RAGs [which is a way of augmenting AI models with data fetched from an external source]. A lot of enterprises are just thinking of the classical prompt to generate an answer. But really, what we want to do is [chain] all these retrieval-augmented generative systems all together. Because if you think about you, and a task that you might want to get done: "Oh, I gotta go talk to the database team. And that database team's got to go talk to the Tableau team. They gotta make me a dashboard," and all these things have to happen before you can actually complete the task. And so it's kind of that event-driven RAG. I wouldn't say RAGs talking to RAGs, but it's essentially that – agents going off and performing a lot of work and coming back. And we're on the cusp of that. So I think that's kind of something I'm really excited about seeing in 2024.

The Register: Is Nvidia dogfooding its own AI? Have you found AI useful internally?

Briski: Actually, we went off and last year, since 2023 was the year of exploration, there were 150 teams within Nvidia that I found – there could have been more – and we were trying to say, how are you using our tools, what kind of use cases and we started to combine all of the learnings, kind of from like a thousand flowers blooming, and we kind of combined all their learnings into best practices into one repo. That's actually what we released as what we call Generative AI Examples on GitHub, because we just wanted to have all the best practices in one place.

That's kind of what we did structurally. But as an explicit example, I think we wrote this really great paper called ChipNeMo, and it's actually all about our EDA, VLSI design team, and how they took the foundation model and they trained it on our proprietary data. We have our own coding languages for VLSI. So they were coding copilots [open source code generation models] to be able to generate our proprietary language and to help the productivity of new engineers coming on who don't quite know our VLSI design chip writing code.

And that has resonated with every customer. So if you talk to SAP, they have ABAP (Advanced Business Application Programming,) which is like a proprietary SQL to their database. And I talked to three other customers that had different proprietary languages – even SQL has like hundreds of dialects. So being able to do code generation is not a use case that's immediately solvable by RAG. Yes, RAG helps retrieve documentation and some code snippets, but unless it's trained to generate the tokens in that language, it can't just make up code.

The Register: When you look at large language models and the way they're being chained together with applications, are you thinking about the latency that may introduce and how to deal with that? Are there times when simply hardcoding a decision tree seems like it would make more sense?

Briski: You're right, when you ask a particular question, or prompt, there could be, just even for one question, there could be five or seven models already kicked off so you can get prompt rewriting and guardrails and retriever and re-ranking and then the generator. That's why the NIM is so important, because we have optimized for latency.

That's also why we offer different versions of the foundation models because you might have an SLM, a small language model that's kind of better for a particular set of tasks, and then you want the larger model for more accuracy at the end. But then chaining that all up to fit in your latency window is a problem that we've been solving over the years for many hyperscale or managed services. They have these latency windows and a lot of times when you ask a question or do a search, they're actually going off and farming out the question multiple times. So they've got a lot of race conditions of "what is my latency window for each little part of the total response?" So yes, we're always looking at that.

To your point about hardcoding, I just talked to a customer about that today. We are way beyond hardcoding … You could use a dialogue manager and have if-then-else. [But] managing the thousands of rules is really, really impossible. And that's why we like things like guardrails, because guardrails represent a sort of replacement to a classical dialogue manager. Instead of saying, "Don't talk about baseball, don't talk about softball, don't talk about football," and listing them out you can just say, "Don't talk about sports." And then the LLM knows what a sport is. The time savings, and being able to manage that code later, is so much better. ®

More about


Send us news

Other stories you might like