Meta shares latest hardware – you can't wear it on your face, so don't panic
More capable kit ready for more demanding machine learning tasks
At the 2022 Open Compute Project (OCP) Global Summit on Tuesday, Meta introduced its second-generation GPU-powered datacenter hardware for machine learning and inference – a system called Grand Teton.
"We're excited to announce Grand Teton, our next-generation platform for AI at scale that we'll contribute to the OCP community," said Alexis Björlin, VP of engineering at Meta, in a note to The Register. "As with other technologies, we’ve been diligently bringing AI platforms to the OCP community for many years and look forward to continued partnership."
Tuned for fast processing of large scale AI workloads in datacenters, Grand Teton boasts numerous improvements over its predecessor Zion, such as 4x the host-to-GPU bandwidth, 2x the compute and data network bandwidth, and a 2x better power envelope.
Where the Zion-EX platform consisted of multiple connected subsystems, Grand Teton unifies those components in a single hardware chassis.
According to Björlin, Zion consists of a CPU head node, a switch sync system, and GPU system, all linked via external cabling. Grand Teton is a single box with integrated power, compute, and fabric interfaces, resulting in better performance, signal integrity, and thermal performance. The design supposedly makes datacenter integration easier and enhances reliability.
Grand Teton has been engineered to better handle memory-bandwidth-bound workloads like deep learning recommendation models (DLRMs), which can require a zetaflop of compute power just to train. It's also optimized for compute-bound workloads like content understanding.
In the hope that someone wants to view its datacenter blueprints using the VR goggles it sells, Meta has created a website to host 3D models of its hardware designs, metainfrahardware.com. The biz is focused on pushing Metaverse, a galaxy of interconnected virtual-reality worlds, accessible using VR headsets.
- HPE unveils Arm-based ProLiant server for cloud-native workloads
- Open Compute Project to design open silicon and optics in Strategy 2.0
- Supermicro's 'universal GPU' system welcomes all elements
- Is a lack of standards holding immersion cooling back?
OCP was founded in 2011 by Facebook, which reorganized last year under a parent company without scandal baggage called Meta. OCP aims to allow large consumers of computing power to share hardware designs for datacenter servers and related equipment optimized for enterprise and hyperscale work. OCP essentially allowed Facebook, Google, and others in the cloud to specify exactly the boxes they wanted, and have contract manufacturers turn them out on demand, rather than have server vendors dictate the designs. The project has since widened its community.
That means OCP is a collection of open specifications, best practices, and other things that people can follow or tap into if they want to build out interoperable gear or take inspiration from the cloud giants. The contributed designs are useful or interesting in seeing where the big players are headed in terms of their datacenter needs, and what design decisions are being taken to achieve the scale they want.
OCP's market impact has been fairly modest: companies spent more than $16 billion on OCP kit in 2020 and that figure is projected to reach $46 billion by 2025. The total datacenter infrastructure market is expected to be about $230 billion in 2025.
Meta is also talking up Open Rack v3 (ORV3), the latest iteration of its common rack and power architecture, which aims to make deploying and servicing rack-mounted IT gear easier. ORV3 features a power shelf that can be installed anywhere in the rack.
"Multiple shelves can be installed on a single busbar to support 30kW racks, while 48VDC output will support higher power transmission needs in the future," said Björlin in a blog post due to go live today. "It also features an improved battery backup unit, upping the capacity to four minutes, compared with the previous model's 90 seconds, and with a power capacity of 15kW per shelf."
ORV3 has been designed to accommodate assorted liquid cooling strategies, such as air-assisted liquid cooling and facility water cooling.
"The power trend increases we're seeing, and the need for liquid cooling advances, are forcing us to think differently about all elements of our platform, rack and power, and data center design," explained Björlin. ®