This article is more than 1 year old
We can unify HPC and AI software environments, just not at the source code level
Compute graphs are the way forward
Register Debate Welcome to the latest Register Debate in which writers discuss technology topics, and you the reader choose the winning argument. The format is simple: we propose a motion, the arguments for the motion will run this Monday and Wednesday, and the arguments against on Tuesday and Thursday. During the week you can cast your vote on which side you support using the poll embedded below, choosing whether you're in favour or against the motion. The final score will be announced on Friday, revealing whether the for or against argument was most popular.
This week's motion is: A unified, agnostic software environment can be achieved. We debate the question: can the industry ever have a truly open, unified, agnostic software environment in HPC and AI that can span multiple kinds of compute engines?
Arguing today FOR the motion is Rob Farber, a global technology consultant and author with an extensive background in HPC and in developing machine-learning technology that he applies at national laboratories and commercial organizations. Rob can be reached at info@techenablement.com.
The idea of a unified, agnostic software environment is an admirable goal, but difficult to achieve at the source code level because no one and no single machine architecture – current or yet to be designed – can be left out.
To say anything meaningful, specifically that there is a single “software environment” (note the air quotes) that will work for all codes, we have to look at how compilers utilize something called a compute graph, which is a directed graph where the nodes correspond to operations or variables and data flows through the graph to compute a desired result.
At the human level, the fundamental problem is that people write applications how they want, and in their language of choice. Specifying a single C/C++ API, for example, leaves Fortran programmers out. Meanwhile, high level languages like Python try to hide details of the underlying computer architecture and memory management from the programmer. Superb for some applications and problematic for others.
Pragmas have been proposed as a solution because they let the compiler deal with the challenge of generating appropriate code for the destination architecture. Very simply, pragmas are not for everyone or every problem. A friend once described pragma programming as “negotiating with the compiler” where the computer has infinite patience and each release of the compiler may get it wrong.
Compounding the problem, pragmas have their own biases built-in. The OpenACC standard, for example, does not provide for protected regions of code to safeguard GPU performance, which can be problematic for some OpenMP CPU codes.
This highlights problems with trying to specify a single source level programming environment for all machine architectures. GPUs are great for massive parallelism but their single instruction multiple thread (SIMT) model simply doesn’t work well for some problems. CPUs are great because they utilize a multiple instruction multiple data (MIMD) computational model, and thus are not inherently limited by their computational model, but instead are challenged in achieving the same degree of power efficiency and massive parallelism as GPUs.
This is why hybrid CPU/GPU configurations are now so popular. Meanwhile, novel non-von-Neumann computing architectures that are reconfigurable and support dataflow processing are on the cusp of commercial viability.
However, non-von-Neumann hardware designs provide an interesting insight to the goal of a single environment because they permit mapping of the compute graph – a data structure that is already generated by the compiler regardless of source language – directly onto hardware. Data simply flows from one computational unit to another through the graph until the desired computational result is realized.
These graphs constitute the “software environment” that can leverage all the hardware density and parallelism that modern semiconductor manufacturing can pack on a chip. Performance leverages the decades of work by compiler writers to optimize their compute graphs to maximize use of the hardware compute capabilities and minimize performance limiting external memory accesses. Parallelism can be achieved by pipelining work through the compute graph and instantiating multiple compute graphs to process data in parallel.
LLVM is an existence proof as most compilers in the world already use this framework. MLIR offers a richer graph environment and unifying theme. Looking to the future, there is huge interest in optimizing deep learning models to run optimally on various target hardware platforms – just look here, here, here, and here. It is reasonable to think that this optimization work on ANN graphs can be extended to generalized compute graphs. Time will tell. ®
Cast your vote below. We'll close the poll on Thursday night and publish the final result on Friday. You can track the debate's progress here.