HPC

Petaflops beater: Nvidia chief talks exascale

Programming for parallel processes


"Power is now the limiter of every computing platform, from cellphones to PCs and even data centres," said NVIDIA chief executive Jen-Hsun Huang, speaking at the company's GPU Technology Conference in Beijing last week. There was much talk there about the path to exascale, a form of supercomputing that can execute 1018 flop/s (Floating Point Operations per Second).

Currently, the world's fastest supercomputer, Japan's K computer, achieves 10 petaflops (one petaflop = a thousand trillion floating point operations per second), just 1 per cent of exascale. The K computer consumes 12.66MW (megawatts), and Huang suggests that a realistic limit for a supercomputer is 20MW, which is why achieving exascale is a matter of power efficiency as well as size. At the other end of the scale, power efficiency determines whether your smartphone or tablet will last the day without a recharge, making this a key issue for everyone.

Huang's thesis is that the CPU, which is optimised for single-threaded execution, will not deliver the required efficiency. "With four cores, in order to execute an operation, a floating point add or a floating point multiply, 50 times more energy is dedicated to the scheduling of that operation than the operation itself," he says.

Jen-Hsun Huang, photo Tim Anderson

Power limits: NVIDIA chief executive Jen-Hsun Huang

"We believe the right approach is to use much more energy-efficient processors. Using much simpler processors and many of them, we can optimise for throughput. The unfortunate part is that this processor would no longer be good for single-threaded applications. By adding the two processors, the sequential code can run on the CPU, the parallel code can run on the GPU, and as a result you can get the benefit of the both. We call it heterogeneous computing."

He would say that. NVIDIA makes GPUs after all. But the message is being heard in the supercomputing world, where 39 of the top 500 use GPUs, up from 17 a year ago, and including the number 2 supercomputer: Tianhe-1A in China. Thirty-five of those 39 GPUs are from NVIDIA.

At a mere 2.57 petaflops though, Tianhe-1A is well behind the K computer, which does not use GPUs. Does that undermine Huang's thesis? "If you were to design the K computer with heterogeneous architecture, it would be even more," he insists. "At the time the K computer was conceived, almost 10 years ago, heterogeneous was not very popular."

Using GPUs for purposes other than driving a display is only practical because of changes made to the architecture to support general-purpose programming. NVIDIA's system is called CUDA and is programmed using CUDA C/C++. The latest CUDA compiler is based on LLVM, which makes it easier to add support for other languages. In addition, the company has just announced that it will release the compiler source code to researchers and tool vendors. "It's open source enough that anybody who would like to develop their target compiler can do it," says Huang.

Another strand to programming the GPU is OpenACC, a set of directives you can add to C code that tell the compiler to transform it to parallelised code that runs on the GPU when available. "We've made it almost trivial for people with legacy applications that have large parallel loops to use directives to get a huge speedup," claims Huang.

OpenACC is not yet implemented, though it is based on an existing product from the Portland Group called PGI Accelerator. Cray and CAPS also plan to have OpenACC support in their compilers. These will require NVIDIA GPUs to get the full benefit, though it is a standard that others could implement. There is a programming standard called OpenCL that is already supported by multiple GPU vendors, but it is lower level and therefore less productive than CUDA or OpenACC.

Next page: Blurred lines

Other stories you might like

  • Apple’s M2 chip isn’t a slam dunk, but it does point to the future
    The chip’s GPU and neural engine could overshadow Apple’s concession on CPU performance

    Analysis For all the pomp and circumstance surrounding Apple's move to homegrown silicon for Macs, the tech giant has admitted that the new M2 chip isn't quite the slam dunk that its predecessor was when compared to the latest from Apple's former CPU supplier, Intel.

    During its WWDC 2022 keynote Monday, Apple focused its high-level sales pitch for the M2 on claims that the chip is much more power efficient than Intel's latest laptop CPUs. But while doing so, the iPhone maker admitted that Intel has it beat, at least for now, when it comes to CPU performance.

    Apple laid this out clearly during the presentation when Johny Srouji, Apple's senior vice president of hardware technologies, said the M2's eight-core CPU will provide 87 percent of the peak performance of Intel's 12-core Core i7-1260P while using just a quarter of the rival chip's power.

    Continue reading
  • Intel offers GPU management tool ahead of Ponte Vecchio debut
    It's even open source, so someone may actually use it

    With Intel poised to enter the datacenter GPU market, the chipmaker this week showed off a software platform mean to simplify management of these devices at scale at the International Supercomputing Conference in Hamburg, Germany.

    The open-source software, dubbed Intel XPU Manager, is an in-band remote management service for upgrading firmware, monitoring system utilization, and administering GPUs at the individual node level. The code is an important step as Intel prepares to compete against Nvidia, which has a mature software stack for GPUs with AMD working hard to get its software straight for GPU and CPU.

    XPU Manager is a low-level management interface that runs in Kubernetes and is designed to be integrated into existing cluster management and schedulers using RESTful APIs. It also supports local management via the CLI and is validated for use on Ubuntu 20.04 or Red Hat Enterprise Linux 8.4.

    Continue reading
  • Intel’s Falcon Shores XPU to mix ‘n’ match CPUs, GPUs within processor package
    x86 giant now has an HPC roadmap, which includes successor to Ponte Vecchio

    After a few years of teasing Ponte Vecchio – the powerful GPU that will go into what will become one of the fastest supercomputers in the world – Intel is sharing more details of the high-performance computing chips that will follow, and one of them will combine CPUs and GPUs in one package.

    The semiconductor giant shared the details Tuesday in a roadmap update for its HPC-focused products at the International Supercomputing Conference in Hamburg, Germany.

    Intel has only recently carved out a separate group of products for HPC applications because it is now developing versions of Xeon Scalable CPUs, starting with a high-bandwidth-memory (HBM) variant of the forthcoming Sapphire Rapids chips, for high-performance kit. This chip will sport up to 64GB of HBM2e memory, which will give it quick access to very large datasets.

    Continue reading

Biting the hand that feeds IT © 1998–2022