Security

This article is more than 1 year old

Nvidia DGX systems prone to side channel, covert attacks

Reverse engineering yields sticky microarchitectural vulnerabilities

Thu 31 Mar 2022 // 13:43 UTC

Nvidia's ultra-dense GPU-driven AI training and inference systems are prone to covert and side channel attacks, according to research just published from a team led by Pacific Northwest National Laboratory (PNNL). This might be less concerning for those with on-prem DGX systems, but for cloud vendors selling time on the AI training boxes, the vulnerabilities are worth noting.

Let's start with the good news: the problems are most pressing for pre-Ampere GPU generation DGX machines and luckily, the major cloud operators have made the DGX switch to Nvidia Ampere-generation DGX machines. The bad news? Owners of Pascal and Volta based DGX boxes, read on.

Unlike more brute-force ways of compromising a system, the vulnerabilities cited by the PNNL-led group focus on microarchitectural gaps [PDF]. These can affect both on-prem and remotely hosted systems. The team executed a proof of concept attack to demonstrate the issues.

Specifically, the team reverse-engineered the cache hierarchy, showing how an attack on a single GPU can hit the L2 cache of a connected GPU (the accelerators are hooked together with Nvidia's proprietary NVLink) and cause a contention issue on a connected GPU.

They also developed a "prime and probe attack on a remote GPU allowing an attacker to recover the cache hit and miss behavior of another workload."

In reverse engineering the caches and poking around the shared Non-Uniform Memory Access (NUMA) configuration the team found "the L2 cache on each GPU caches the data for any memory pages mapped to that GPU's physical memory (even from a remote GPU)."

They add that "this observation enables us to create contention on remote caches by allocating memory on the target GPU, which is the essential ingredient enabling our covert and side channels. Specifically, we develop the first microarchitectural covert and side-channel attacks across GPUs in a multi-GPU servers (an Nvidia DGX-1 server)."

Aside from the obvious, especially in the cloud case, these vulnerabilities are noteworthy because they are likely going to be difficult to pinpoint.

As the team notes, instead of attacking a single GPU on a node (instead focusing on a tightly interconnected system) attackers don't need to manipulate the scheduler for one GPU to hook into the victim's kernel.

They also "bypass isolation-based defenses, such as partition-based defense mechanisms that can be enabled for processes running within a single GPU," the team explains, adding:

The attacks we develop are first Prime+Probe based timing attacks on L2 cache on GPUs. Our attacks extract contention information at the granularity of a single cache set, providing highresolution attacks with fine-grained access time measurements, reducing the noise, and achieving high quality channels. The attacks are conducted entirely from the user level without any special access (e.g. huge pages or flush instruction). As a result, we believe this attack model challenges assumptions from prior GPU based attacks and significantly expands our understanding of the threat model in Multi-GPU servers.

With all of this said, there are mitigations, including static or dynamic partitioning of shared resources. This is easier with the newest Nvidia A100 GPU-based DGX machines, which have this built in. In essence, each individual GPU can be sealed off into discrete GPU instances in multi-user environments, which means direct and isolated paths through the cache and memory.

There are partitioning mechanisms the team proposes but these do have some performance overhead. "Although inherent GPU-to-GPU communications cannot be completely eliminated in multiGPU systems, making these cross-GPU data transfers more coarse-grained in normal applications will significantly increase the detection accuracy of high-bandwidth attacks, leading to more efficient defenses."

"Our work establishes for the first time the vulnerability of these machines to microarchitectural attacks, and we hope that it guides future research to improve their security," the team adds.

Topics

Special Features

Vendor Voice

Resources

Security

Nvidia DGX systems prone to side channel, covert attacks

Reverse engineering yields sticky microarchitectural vulnerabilities

More about

More about

Narrower topics

More about

More about

More about

Narrower topics

TIP US OFF

Other stories you might like

Nvidia's newborn ChatRTX bot patched for security bugs

Intel Gaudi's third and final hurrah is an AI accelerator built to best Nvidia's H100

Microsoft squashes SmartScreen security bypass bug exploited in the wild

A different view from the edge

AI cloud startup TensorWave bets AMD can beat Nvidia

US government excoriates Microsoft for 'avoidable errors' but keeps paying for its products

China scientists talk of powering hypersonic weapon with cheap Nvidia chip

Los Alamos Lab powers up Nvidia-laden Venado supercomputer

Cisco creates architecture to improve security and sell you new switches

OpenAI's GPT-4 can exploit real vulnerabilities by reading security advisories

Japanese government rejects Yahoo! infosec improvement plan

H-1B visa fraud alive and well amid efforts to crack down on abuse

About Us

Our Websites

Your Privacy