VMware claims 'bare-metal' performance on virtualized GPUs

Is... is that why Broadcom wants to buy it?


The future of high-performance computing will be virtualized, VMware's Uday Kurkure has told The Register.

Kurkure, the lead engineer for VMware's performance engineering team, has spent the past five years working on ways to virtualize machine-learning workloads running on accelerators. Earlier this month his team reported "near or better than bare-metal performance" for Bidirectional Encoder Representations from Transformers (BERT) and Mask R-CNN — two popular machine-learning workloads — running on virtualized GPUs (vGPU) connected using Nvidia's NVLink interconnect.

NVLink enables compute and memory resources to be shared across up to four GPUs over a high-bandwidth mesh fabric operating at 6.25GB/s per lane compared to PCIe 4.0's 2.5GB/s. The interconnect enabled Kurkure's team to pool 160GB of GPU memory from the Dell PowerEdge system's four 40GB Nvidia A100 SXM GPUs.

"As the machine learning models get bigger and bigger, they don't fit into the graphics memory of a single chip, so you need to use multiple GPUs," he explained.

Support for NVLink in VMware's vSphere is a relatively new addition. By toggling NVLink on and off in vSphere between tests, Kurkure was able to determine how large of an impact the interconnect had on performance.

And in what should be a surprise to no one, the large ML workloads ran faster, scaling linearly with additional GPUs, when NVLink was enabled.

Testing showed Mask R-CNN training running 15 percent faster in a twin GPU, NVLink configuration, and 18 percent faster when using all four A100s. The performance delta was even greater in the BERT natural language processing model, where the NVLink-enabled system performed 243 percent faster when running on all four GPUs.

What's more, Kurkure says the virtualized GPUs were able to achieve the same or better performance compared to running the same workloads on bare metal.

"Now with NVLink being supported in vSphere, customers have the flexibility where they can combine multiple GPUs on the same host using NVLink so they can support bigger models, without a significant communication overhead," Kurkure said.

HPC, enterprise implications

Based on the results of these tests, Kurkure expects most HPC workloads will be virtualized moving forward. The HPC community is always running into performance bottlenecks that leaves systems underutilized, he added, arguing that virtualization enables users to make much more efficient use of their systems.

Kurkure's team was able to achieve performance comparable to bare metal while using just a fraction of the dual-socket system's CPU resources.

"We were only using 16 logical cores out of 128 available," he said. "You could use that CPU resources for other jobs without affecting your machine-learning intensive graphics modules. This is going to improve your utilization, and bring down the cost of your datacenter."

A road leading up to a question mark in a cloud

Broadcom to buy VMware 'on Thursday for $60 billion'

READ MORE

By toggling on and off NVLink between GPUs, additional platform flexibility can be achieved by enabling multiple isolated AI/ML workloads to be spread across the GPUs simultaneously.

"One of the key takeaways of this testing was that because of the improved utilization offered by vGPUs connected over a NVLink mesh network, VMware was able to achieve bare-metal-like performance while freeing idle resources for other workloads," Kurkure said.

VMWare expects these results to improve resource utilization in several applications, including investment banking, pharmaceutical research, 3D CAD, and auto manufacturing. 3D CAD is a particularly high-demand area for HPC virtualization, according to Kurkure, who cited several customers looking to implement machine learning to assist with the design process.

And while it's possible to run many of these workloads on GPUs in the cloud, he argued that cost and/or intellectual property rules may prevent them from doing so.

vGPU vs MIG

An important note is VMware's tests were conducted using Nvidia's vGPU Manager in vSphere as opposed to the hardware-level partitioning offered by multi-instance GPU (MIG) on the A100. MIG essentially allows the A100 to behave like up to seven less-powerful GPUs.

By comparison, vGPUs are defined in the hypervisor and are time-sliced. You can think of this as multitasking where the GPU rapidly cycles through each vGPU workload until they're completed.

The benefit of vGPUs is users can scale well beyond seven GPU instances at the cost of potential overheads associated with rapid context switching, Kurkure explained. However, at least in his testing, the use of vGPUs didn't appear to have a negative impact on performance compared to running on bare metal with the GPUs passed through to the VM.

Whether MIG would change this dynamic remains to be seen and is the subject of another ongoing investigation by Kurkure's team. "It's not clear when you should be using vGPU and when we should be running in MIG mode," he said.

More to come

With vGPU with NVLink validated for scale-up workloads, VMware is now exploring options such as how these workloads scale across multiple systems and racks over RDMA over converged Ethernet (RoCE). Here, he says, networking becomes a major consideration.

"The natural extension of this is scale out," he said. "So, we'll have a number of hosted connected by RoCE."

VMware is also investing how virtualized GPUs perform with even larger AI/ML models,

Kurkure's team is also investigating how these architectures scale with even larger AI/ML, like GPT-3, as well as how they can be applied to telco workloads running at the edge. ®

Broader topics


Other stories you might like

  • Ditching VMware over the Broadcom buy? Here are some of your options
    What's your contingency plan?

    Opinion Broadcom has yet to close the deal on taking over VMware, but the industry is already awash with speculation and analysis as to how the event could impact the cloud giant's product availability and pricing.

    If Broadcom's track record and stated strategy tell us anything, we could soon see VMware refocus its efforts on its top 600 customers and raise prices, and leave thousands more searching for an alternative.

    The jury is still out as to whether Broadcom will repeat the past or take a different approach. But, when it comes to VMware's ESXi hypervisor, customer concern is valid. There aren't many vendor options that can take on VMware in this arena, Forrester analyst Naveen Chhabra, tells The Register.

    Continue reading
  • Microsoft promises to tighten access to AI it now deems too risky for some devs
    Deep-fake voices, face recognition, emotion, age and gender prediction ... A toolbox of theoretical tech tyranny

    Microsoft has pledged to clamp down on access to AI tools designed to predict emotions, gender, and age from images, and will restrict the usage of its facial recognition and generative audio models in Azure.

    The Windows giant made the promise on Tuesday while also sharing its so-called Responsible AI Standard, a document [PDF] in which the US corporation vowed to minimize any harm inflicted by its machine-learning software. This pledge included assurances that the biz will assess the impact of its technologies, document models' data and capabilities, and enforce stricter use guidelines.

    This is needed because – and let's just check the notes here – there are apparently not enough laws yet regulating machine-learning technology use. Thus, in the absence of this legislation, Microsoft will just have to force itself to do the right thing.

    Continue reading
  • Is computer vision the cure for school shootings? Likely not
    Gun-detecting AI outfits want to help while root causes need tackling

    Comment More than 250 mass shootings have occurred in the US so far this year, and AI advocates think they have the solution. Not gun control, but better tech, unsurprisingly.

    Machine-learning biz Kogniz announced on Tuesday it was adding a ready-to-deploy gun detection model to its computer-vision platform. The system, we're told, can detect guns seen by security cameras and send notifications to those at risk, notifying police, locking down buildings, and performing other security tasks. 

    In addition to spotting firearms, Kogniz uses its other computer-vision modules to notice unusual behavior, such as children sprinting down hallways or someone climbing in through a window, which could indicate an active shooter.

    Continue reading

Biting the hand that feeds IT © 1998–2022