Nvidia: Better parallelism coming to standard C++ lib

Which could be useful for people writing code to run across CPUs, GPUs, etc


GTC Nvidia said the lines are blurring between the standard C++ and Nv's CUDA C++ library when it comes to parallel execution of code.

C++ itself is "starting to enable parallel algorithms and asynchronous execution as first-class components of the language," said Stephen Jones, CUDA architect at Nvidia, during a break-out session on CUDA at Nv's GPU Technology Conference (GTC) this week.

"I think, by far the most exciting move for standard C++ in that direction," Jones added.

A C++ committee is developing an asynchronous programming abstraction layer involving senders and receivers, which can schedule work to run within generic execution contexts. A context might be a CPU thread doing mainly IO, or a CPU or GPU thread doing intensive computation. This management is not tied to specific hardware. "This is a framework for orchestrating parallel execution, writing your own portable parallel algorithms [with an] emphasis on portability," Jones said.

A paper proposing the design noted that the programming language needed "standard vocabulary and framework for asynchrony and parallelism that C++ programmers desperately need." The draft lists, among others, Michael Garland, senior director of programming systems and applications at Nvidia, as a proposer.

The paper noted that "C++11’s intended exposure for asynchrony, is inefficient, hard to use correctly, and severely lacking in genericity, making it unusable in many contexts. We introduced parallel algorithms to the C++ Standard Library in C++17, and while they are an excellent start, they are all inherently synchronous and not composable."

Senders and receivers are a unifying point for running workloads across a range of targets and programming models, and are designed for heterogeneous systems, Jones said.

"The idea with senders and receivers is that you can express execution dependencies and compose together asynchronous task graphs in standard C++," Jones said. "I can target CPUs or GPUs, single thread, multi thread, even multi GPU."

This is all good news for Nvidia, for one, as it should make it easier for people to write software to run across its GPUs, DPUs, CPUs, and other chips. Nvidia's CUDA C++ library, called libcu++ and which already provides a "heterogeneous implementation" of the standard C++ library, is online for HPC and CUDA devs.

At GTC, Nvidia emitted more than 60 updates to its libraries, including frameworks for quantum computing, 6G networks, robotics, cybersecurity, and drug discovery.

"With each new SDK, new science, new applications and new industries can tap into the power of Nvidia computing. These SDKs tackle the immense complexity at the intersection of computing algorithms and science," CEO Jensen Huang during a keynote on Tuesday.

Amazing grace

Nvidia also introduced the Hopper H100 GPU, which Jones said had features to speed up processing by minimizing data movement and keeping information local.

"There's some profound new architectural features which change the way we program the GPU. It takes the asynchrony steps that we started making in the A100 and moves them forward," Jones said.

One such improvement is 132 streaming-multiprocessor (SM) units in the H100, up from 15 in Kepler. "There's this ability to scale across SMs that is at the core of the CUDA programming model," Jones said.

There's another feature called the thread block cluster, in which multiple thread blocks operate concurrently across multiple SMs, exchanging data in a synchronized way. Jones called it a "block of blocks" with 16,384 concurrent threads in a cluster.

"By adding a cluster to the execution hierarchy, we are allowing an application to take advantage of faster local synchronization, faster memory sharing, all sorts of other good things like that," Jones said.

Another asynchronous execution feature is a new Tensor Memory Accelerator (TMA) unit, which the company says transfers large data blocks efficiently between global memory and shared memory, and asynchronously copies between thread blocks in a cluster.

Jones called TMA "a self-contained data movement engine" that is a separate hardware unit in the SM that runs independently of SM threads. "Instead of every thread in the block participating in the asynchronous memory copy, the TMA can take over and handle all the loops and address and calculations for you," Jones said.

Nvidia has also added an asynchronous transaction barrier in which waiting threads can sleep until all other threads arrive, for atomic data transfer and synchronization purposes.

"You just say 'Wake me up when the data has arrived.' I can have my thread waiting ... expecting data from lots of different places and only wake up when it's all arrived," Jones said. "It's seven times faster than normal communication. I don't have all that back and forth. It's just a single write operation."

Nvidia also streamlined and improved the runtime compilation speed, which is where code is presented to CUDA for compilation.

"We streamline the internals of both the CUDA C++ and PTX compilers," Jones said, adding, "we've also made the runtime compiler multithreaded, which can halve the compilation time if you're using more CPU threads."

More news on the compiler front is support for C++20, which will come out in the upcoming CUDA 11.7 release.

"It's not yet going to be available on Microsoft Visual Studio that's coming in the following release, but it means that you can use C++ 20 in both your host and your device code," Jones said. ®


Other stories you might like

  • Meta loves AWS so much, it's, er, using Microsoft Azure for AI work
    Someone got Zuck'd over?

    Meta’s AI business unit set up shop in Microsoft Azure this week and announced a strategic partnership it says will advance PyTorch development on the public cloud.

    The deal [PDF] will see Mark Zuckerberg’s umbrella company deploy machine-learning workloads on thousands of Nvidia GPUs running in Azure. While a win for Microsoft, the partnership calls in to question just how strong Meta’s commitment to Amazon Web Services (AWS) really is.

    Back in those long-gone days of December, Meta named AWS as its “key long-term strategic cloud provider." As part of that, Meta promised that if it bought any companies that used AWS, it would continue to support their use of Amazon's cloud, rather than force them off into its own private datacenters. The pact also included a vow to expand Meta’s consumption of Amazon’s cloud-based compute, storage, database, and security services.

    Continue reading
  • Atos pushes out HPC cloud services based on Nimbix tech
    Moore's Law got you down? Throw everything at the problem! Quantum, AI, cloud...

    IT services biz Atos has introduced a suite of cloud-based high-performance computing (HPC) services, based around technology gained from its purchase of cloud provider Nimbix last year.

    The Nimbix Supercomputing Suite is described by Atos as a set of flexible and secure HPC solutions available as a service. It includes access to HPC, AI, and quantum computing resources, according to the services company.

    In addition to the existing Nimbix HPC products, the updated portfolio includes a new federated supercomputing-as-a-service platform and a dedicated bare-metal service based on Atos BullSequana supercomputer hardware.

    Continue reading
  • In record year for vulnerabilities, Microsoft actually had fewer
    Occasional gaping hole and overprivileged users still blight the Beast of Redmond

    Despite a record number of publicly disclosed security flaws in 2021, Microsoft managed to improve its stats, according to research from BeyondTrust.

    Figures from the National Vulnerability Database (NVD) of the US National Institute of Standards and Technology (NIST) show last year broke all records for security vulnerabilities. By December, according to pentester Redscan, 18,439 were recorded. That's an average of more than 50 flaws a day.

    However just 1,212 vulnerabilities were reported in Microsoft products last year, said BeyondTrust, a 5 percent drop on the previous year. In addition, critical vulnerabilities in the software (those with a CVSS score of 9 or more) plunged 47 percent, with the drop in Windows Server specifically down 50 percent. There was bad news for Internet Explorer and Edge vulnerabilities, though: they were up 280 percent on the prior year, with 349 flaws spotted in 2021.

    Continue reading
  • ServiceNow takes aim at procurement pain points
    Purchasing teams are a bit like help desks – always being asked to answer dumb or inappropriate questions

    ServiceNow's efforts to expand into more industries will soon include a Procurement Service Management product.

    This is not a dedicated application – ServiceNow has occasionally flirted with templates for its platform that come very close to being apps. Instead it stays close to the company's core of providing workflows that put the right jobs in the right hands, and make sure they get done. In this case, it will do so by tickling ERP and dedicated procurement applications, using tech ServiceNow acquired along with a company called Gekkobrain in 2021.

    The company believes it can play to its strengths with procurements via a single, centralized buying team.

    Continue reading
  • HPE, Cerebras build AI supercomputer for scientific research
    Wafer madness hits the LRZ in HPE Superdome supercomputer wrapper

    HPE and Cerebras Systems have built a new AI supercomputer in Munich, Germany, pairing a HPE Superdome Flex with the AI accelerator technology from Cerebras for use by the scientific and engineering community.

    The new system, created for the Leibniz Supercomputing Center (LRZ) in Munich, is being deployed to meet the current and expected future compute needs of researchers, including larger deep learning neural network models and the emergence of multi-modal problems that involve multiple data types such as images and speech, according to Laura Schulz, LRZ's head of Strategic Developments and Partnerships.

    "We're seeing an increase in large data volumes coming at us that need more and more processing, and models that are taking months to train, we want to be able to speed that up," Schulz said.

    Continue reading
  • We have bigger targets than beating Oracle, say open source DB pioneers
    Advocates for MySQL and PostgreSQL see broader future for movement they helped create

    MySQL pioneer Peter Zaitsev, an early employee of MySQL AB under the original open source database author Michael "Monty" Widenius, once found it easy to identify the enemy.

    "In the early days of MySQL AB, we were there to get Oracle's ass. Our CEO Mårten Mickos was always telling us how we were going to get out there and replace all those Oracle database installations," Zaitsev told The Register.

    Speaking at Percona Live, the open source database event hosted by the services company Zaitsev founded in 2006 and runs as chief exec, he said that situation had changed since Oracle ended up owning MySQL in 2010. This was as a consequence of its acquisition that year of Sun Microsystems, which had bought MySQL AB just two years earlier.

    Continue reading
  • Beijing needs the ability to 'destroy' Starlink, say Chinese researchers
    Paper authors warn Elon Musk's 2,400 machines could be used offensively

    An egghead at the Beijing Institute of Tracking and Telecommunications, writing in a peer-reviewed domestic journal, has advocated for Chinese military capability to take out Starlink satellites on the grounds of national security.

    According to the South China Morning Post, lead author Ren Yuanzhen and colleagues advocated in Modern Defence Technology not only for China to develop anti-satellite capabilities, but also to have a surveillance system that could monitor and track all satellites in Starlink's constellation.

    "A combination of soft and hard kill methods should be adopted to make some Starlink satellites lose their functions and destroy the constellation's operating system," the Chinese boffins reportedly said, estimating that data transmission speeds of stealth fighter jets and US military drones could increase by a factor of 100 through a Musk machine connection.

    Continue reading
  • How to explain what an API is – and why they matter
    Some of us have used them for decades, some are seeing them for the first time on marketing slides

    Systems Approach Explaining what an API is can be surprisingly difficult.

    It's striking to remember that they have been around for about as long as we've had programming languages, and that while the "API economy" might be a relatively recent term, APIs have been enabling innovation for decades. But how to best describe them to someone for whom application programming interfaces mean little or nothing?

    I like this short video from Martin Casado, embedded below, which starts with the analogy of building cars. In the very early days, car manufacturers were vertically integrated businesses, essentially starting from iron ore and coal to make steel all the way through to producing the parts and then the assembled vehicle. As the business matured and grew in size, car manufacturers were able to buy components built by others, and entire companies could be created around supplying just a single component, such as a spring.

    Continue reading

Biting the hand that feeds IT © 1998–2022