HPE and Nvidia offer 'turnkey' supercomputer for AI training

If you can afford it – pricing's not out yet

SC23 HPE and Nvidia say they are giving customers the building blocks to produce a mini version of Bristol University's Isambard-AI supercomputer to train generative AI and deep learning projects.

The two companies are linking arms to sell a modular machine that is based on HPE’s Cray EX2500 architecture and Nvidia’s Grace Hopper Superchip, with a software stack comprising tools from both.

The system, which will be demonstrated at the SC23 high performance computing (HPC) conference in Colorado this week, is designed to be simpler for organizations to get up and running with AI training thanks to a preconfigured and pretested stack. Or at least that is the intention.

According to HPE, this system is the first to feature a quad GH200 Superchip node configuration, meaning each node contains 4 of Nvidia’s high-end silicon. Each Superchip combines a 72-core Arm-based Grace CPU and a Hopper GPU and has access to 480GB of LPDDR5x memory and 144GB of HBM3e high-bandwidth memory.

The nodes are interconnected using HPE’s Slingshot, a network technology that is a kind of superset of Ethernet, but with features added to support high performance compute (HPC) requirements.

This kind of hardware doesn’t come cheap, but HPE said this particular solution allows customers to start relatively small and scale as required.

“We've got a few customers that have announced Grace Hopper Superchips, but this is unique in that the EX2500 allows you to deploy in units of the quantity of one because all of the cooling, the power and the compute plates are in a single chassis,” Justin Hotard, HPE’s EVP for HPC AI and Labs, told us.

This means that the system offers “a very simple way for customers to get started and to continue to scale,” he claimed.

On the software stack provided as part of this setup, there is the HPE Machine Learning Development Environment, a platform to train up generative AI models, largely based on technology HPE gained from its purchase of Determined AI in 2021.

Also included is Nvidia’s AI Enterprise suite, a collection of AI tools and frameworks such as TensorFlow, PyTorch, Nvidia's RAPIDS and TensorRT software libraries, and its Triton Inference Server. Customers also get HPE’s Cray Programming Environment, a bunch of tools for developing, porting, and debugging code.

Hotard said AI training is probably one of the most computationally intensive workloads you can come across which calls for a different compute architecture.

“We all know cloud architectures were optimized around maximizing the utilization on a single server. So the way we think about those workloads is they're ideal to be broken up into smaller and smaller pieces,” he said.

“But AI workloads, particularly training and large scale tuning, are fundamentally different. These workloads require the entire datacenter in some cases to operate as a single computer. It's one workload running over hundreds or thousands of nodes and the compute, the interconnect and the storage need to operate at a scale that's much more consistent with what we see in supercomputers,” he claimed.

Naturally, this new system is meant to deliver on that, for organizations that can afford it, but HPE declined to detail how much it costs. Hotard said prices are to be published in the near future.

Nvidia’s Scientific Program Manager Jack Wells claimed benchmarks showed a single GH200-based node is 100 times faster than a dual Xeon server at processing a large language model (LLM) inference workload using Llama 2.

“Generative AI is restructuring scientific computing, and it's going to really drive a tremendous amount of demand,” he claimed, adding that HPE and Nvidia already have several customers for this product already.

These include the Swiss Federal supercomputer at CSCS, Cyfronet in Poland, Los Alamos National Laboratory, and the Isambard-AI system at Bristol University, the latter of which is slated to deploy 5,448 of the Nvidia GH200 Superchips.

HPE said the service will be available from December in more than 30 countries, and while it targets customers in AI innovation centers within public sector and research institutions, the company also expects to see interest from large enterprises. ®

More about


Send us news

Other stories you might like