AMD’s AI strategy comes into view with Xilinx, GPU, software plans
Chip designer hopes to have broad inference and training coverage from the edge to the cloud
Analysis After re-establishing itself in the datacenter over the past few years, AMD is now hoping to become a big player in the AI compute space with an expanded portfolio of chips that cover everything from the edge to the cloud.
It's quite an ambitious goal, given Nvidia's dominance in the space with its GPUs and the CUDA programming model, plus the increasing competition from Intel and several other companies.
But as executives laid out during AMD's Financial Analyst Day 2022 event last week, the resurgent chip designer believes it has the right silicon and software coming into place to pursue the wider AI space.
"Our vision here is to provide a broad technology roadmap across training and inference that touches cloud, edge and endpoint, and we can do that because we have exposure to all of those markets and all of those products," AMD CEO Lisa Su said in her opening remarks at the end.
Su admitted that it will take "a lot of work" for AMD to catch up in the AI space, but she said the market represents the company's "single highest growth opportunity."
Expanding off early traction with CPUs in AI inference
At last week's event, AMD executives said they have started to see some early traction in the AI compute market with the company's Epyc server chips being used for inference applications and its Instinct datacenter GPUs being deployed for AI model training.
For instance, multiple cloud service providers are already using AMD's software optimizations via its ZenDNN library to provide a "very nice performance uplift" on recommendation engines using the company's Epyc CPUs, according to Dan McNamara, the head of AMD's Epyc business.
Short for Zen Deep Neural Network, ZenDNN is integrated with the popular TensorFlow and PyTorch frameworks as well as ONNXRT, and it's supported by the second and third generation of Epyc chips.
"I think it's really important to say that a large percentage of the inference is happening in CPUs, and we expect that to continue going forward," he said.
In the near future, AMD is looking to introduce more AI capabilities into CPUs at the hardware level.
This includes the AVX-512 VNNI instruction, which will be introduced to accelerate neural network processing in the next-generation Epyc chips, code-named Genoa, coming out later this year.
Since this capability is being implemented in Genoa's Zen 4 architecture, VNNI will also be present in the company's Ryzen 7000 desktop chips that are also due by the end of the year.
AMD plans to expand the AI capabilities of its CPUs future by making use of the AI engine technology from its $49 billion acquisition of FPGA designer Xilinx, which closed earlier this year.
The AI engine, which falls under AMD's newly named XDNA banner of "adaptive architecture" building blocks, will be incorporated into several new products across the company's portfolio in the future.
After making its debut in Xilinx's Versal adaptive chip in 2018, the AI engine will be integrated in two future generations of Ryzen laptop chips. The first is code-named Phoenix Point and will arrive in 2023 while the second is code-named Strix Point and will arrive in 2024. The AI engine will also be used in a future generation of Epyc server chips, though AMD didn't say when that would happen.
In 2024, AMD expects to debut the first chips using its next-generation Zen 5 architecture, which will include new optimizations for AI and machine learning workloads.
Big AI training ambitions with 'first datacenter APU'
As for GPUs, AMD has made some headway in the AI training space with its most recent generation of Instinct GPUs, the MI200 series, and it's hoping to make even more progress in the near future with new silicon and software improvements.
For instance, in the latest version of its ROCm GPU compute software, AMD has added optimizations for training and inference workloads running on frameworks like PyTorch and TensorFlow.
The company has also expanded ROCm support to its consumer-focused Radeon GPUs that use the RDNA architecture, according to David Wang, the head of AMD's GPU business.
"And lastly, we're developing SDKs with pre-optimized models to ease the development and deployment of AI applications," he said.
To drive adoption of its GPUs for AI purposes, Wang said AMD has developed "deep partnerships with some of the key leaders in the industry," including Microsoft and Facebook parent company Meta.
"We have optimized ROCm for PyTorch to deliver amazing, very, very competitive performance for their internal AI workloads as well as the jointly developed open-source benchmarks," he said.
Moving forward, AMD hopes to become even more competitive in the AI training space with the Instinct MI300, which it is calling the "world's first datacenter APU" as the chip combines a Zen 4-based Epyc CPU with a GPU that uses the company's new CDNA 3 architecture.
AMD is claiming that the Instinct MI300 is expected to deliver a greater than 8x boost in AI training performance over its Instinct MI250X chip that is currently in the market.
"The MI300 is a truly amazing part, and we believe it points the direction of the future of acceleration," said Forrest Norrod, head of AMD's Datacenter Solutions Business Group.
Using Xilinx to expand to the edge and improve software
While AMD plans to use Xilinx's tech in future CPUs, the chip designer made it clear that the acquisition will also help the company cover a wider range of opportunities in the AI space and harden its software offerings. The latter is critical if AMD wants to better compete with Nvidia and others.
This was laid out by Victor Peng, Xilinx's former CEO who is now head of AMD's Adaptive and Embedded Group, which leads development for all the FPGA-based products from Xilinx's portfolio.
Before the Xilinx acquisition completed earlier this year, AMD's coverage in the AI compute space was mainly in cloud datacenters with its Epyc and Instinct chips, at enterprises with its Epyc and Ryzen pro chips, and at homes with its Ryzen and Radeon chips.
But with Xilinx's portfolio now under the AMD banner, the chip designer has much broader coverage in the AI market. This is because Xilinx's Zynq adaptive chips are used in a variety of industries, including health care and life sciences, transportation, smart retail, smart cities, and intelligent factories. Xilinx's Versal adaptive chips, on the other hand, are used by telecommunications providers. Xilinx also has Alveo accelerators and Kintex FPGAs that are used in cloud datacenters too.
With the Xilinx acquisition, AMD's products cover several industries in the AI compute space. Click to enlarge.
"We're actually in quite a bit of areas that are doing AI, mostly the inference, but, again, the heavy-duty training is happening in the cloud," Peng said.
AMD views the Xilinx products as "very complementary" with its portfolio of CPUs and GPUs. As such, the company is targeting its combined offerings for a wide spectrum of AI application needs:
- Ryzen and Epyc CPUs, including future Ryzen CPUs with the AI engine, will cover small to medium models for training and inference
- Epyc CPUs with the AI engine, Radeon GPUs and Versal chips will cover medium to large models for training and inference
- Instinct GPUs and Xilinx's adaptive chips will cover very large models for training and inference
"Once we start integrating AI into more of our products and we go to the next generation, we cover a tremendous more of the space across the models," Peng said.
How AMD sees CPUs, GPUs and adaptive chips covering different parts of the AI spectrum. Click to enlarge.
But if AMD wants broader industry adoption of its chips for AI purposes, the company will need to ensure that developers can easily program their applications across this menagerie of silicon.
That's why the chip designer plans to consolidate previously disparate software stacks for CPUs, GPUs and adaptive chips into one interface, which it's calling the AMD Unified AI Stack. The first version will bring together AMD's ROCm software for GPU programming, its CPU software and Xilinx's Vitis AI software to provide unified development and deployment tools for inference workloads.
Peng said the Unified AI Stack will be an ongoing development effort for AMD, which means the company plans to consolidate even more software components in the future, so that, for instance, developers only have to use one machine learning graph compiler for any chip type.
"Now people can, in the same development environment, hit any one of these target architectures. And in the next generation, we're going to unify even more of the middleware," he said.
While AMD has laid out a very ambitious strategy for AI compute, it will no doubt require a lot of heavy lifting and doing right by developers for such a strategy to work. ®