LLNL looks to make HPC a little cloudier with Oxide's rackscale compute platform
System to serve as a proof of concept for applying API-driven automation to scientific computing
SC24 Oxide Computing's 2,500 pound (1.1 metric ton) rackscale blade servers are getting a new home at the Department of Energy's Lawrence Livermore National Laboratory (LNLL).
The planned deployment, announced at the annual Supercomputing event, SC24, in Atlanta on Monday comes as the lab looks to embrace a cloudier approach to on-prem high performance computing.
As compute stacks go, Oxide's are more than a little unusual. Rather than a typical 19-or 21-inch cabinets common in enterprise and hyperscale datacenters, Oxide's rack is really more of a chassis that houses 32 compute nodes. If you notice a general lack of cabling, that's because each of those nodes is interconnected by an integrated backplane that provides not just power but 12.8Tbps of switching capacity to boot.
A full rack comes equipped with some 2048 AMD Epyc cores, 32TB of RAM, no shortage of NVMe storage, and has a rated power draw of up to 15 kilowatts. But pull out one of the hyperscale inspired systems and you'll realize much of the underlying hardware is custom as well. For one, there's no ASpeed BMC on board. Instead, Oxide developed its own that runs a Rust-based operating system called Hubris.
As novel as Oxide's approach to hardware might be, it won't be taking over for any of the HPC clusters, like El Capitan, that call LLNL home.
"The Livermore Computing Center is 40-ish clusters that we provide to end users and we have around 4,000 users for the center overall. But it's not just compute clusters; there's a whole bunch of other hardware there that keeps the compute clusters going," Todd Gamblin, distinguished member of technical staff at LLNL, told The Register in an interview ahead of SC24.
"We have a lot of mission needs emerging that we think are going to require us to have new types of services at the center, specifically more user facing ones," Gamblin added. "Different teams have different needs and so we see demand for more and more cloud services on-prem, and Oxide Racks provides that in a way that we could potentially make user-facing one day."
In this respect, Oxide's approach to software is just as, if not more, interesting to LLNL as it looks to address changing needs within the lab.
The goal is to begin introducing teams to more of an API-driven approach to automating, deploying, and managing virtualized services. This approach, Gamblin explained, also gives LNLL a more flexible way of siloing and isolating users within the rack.
"The fact that we can give the user a silo within the rack. They get their own API endpoint for their project, and they can coordinate the resources that way is really powerful," he said. "Obviously the Oxide Rack doesn't have GPUs, but we see it as sort of a prototype to get both the admins and the users used to dealing with this kind of pure IaaS infrastructure."
While not currently part of Oxide's compute range, GPUs are something the company is exploring. "We just need to find the right substrate in that hardware acceleration package that allows us to kind of deliver that holistic value, not just repackaging a GPU for the sake of repackaging a GPU," Oxide CEO Steve Tuck said.
But even if Oxide's hardware can't replace HPE's Cray EX cabinets just yet, Gamblin is already thinking about how to apply the same kind of virtualization, abstraction, and automation to large scale HPC clusters.
"If you look at the way that we provision the center right now, we have a lot of network zones that are fairly rigid," he said.
This is because clusters in one zone can't talk to another in order to maintain security. If someone has a job that could take advantage of idle compute in another zone, it simply can't.
"If we had a fully virtualized system — the Oxide rack, is a way of prototyping this on the infrastructure side — we could run the workloads for both of those zones on the same rack, and we could essentially allow the zones themselves to be elastic," Gamblin said. "We see this as a prototype of how we would like the center to function in the future."
- HPE goes Cray for Nvidia's Blackwell GPUs, crams 224 into a single cabinet
- AI chip startup Tenstorrent to train Japan's engineers in $50M government deal
- Fujitsu, AMD lay groundwork to pair Monaka CPUs with Instinct GPUs
- xAI picked Ethernet over InfiniBand for its H100 Colossus training cluster
Helping with this strategy is the fact that while Oxide's hardware may be nonstandard, its software is quite open. Based on Illumos Unix and the Bhyve hypervisor, Oxide has developed a virtualization and management stack with deep integration into its hardware.
"The fact that the Oxide stack is open source lets us think about that a whole lot more deeply and think about how the integration would progress over time," Gamblin said.
Oxide's rackscale compute platform won't be limited to LLNL either. The Lab plans to open the system to researchers at Los Alamos and Sandia National Labs.
"We see it as a way for those other labs to have users running services, and we're interested in seeing Oxide rack scale out to multi rack configurations. It's important to us for disaster recovery and also for running multi-site," Gamblin said. ®