HPC

A unified, agnostic software environment spanning HPC and AI won't be achieved

Readers have their say


Register debate It is entirely human and in some senses desirable to eliminate as much of the complexity out of the IT stack as makes sense. And it is also entirely human to be accepting of the need for choice and to acknowledge the necessary frustration of complexity.

There is a tension between these two, as many Reg readers pointed out in our latest debate – Can a unified, agnostic software environment that spans both HPC and AI applications be achieved?

Readers were a little bit more sceptical about this than they were optimistic, and the naysayers edged out the yeasayers, with 53 per cent of the vote against and 47 per cent of the vote for the motion.

JavaScript Disabled

Please Enable JavaScript to use this feature.

The debate opened with Nicole Hemsoth, co-editor of our sister publication, The Next Platform, arguing for the motion and stipulating that it was perhaps time to have the smallest number of tools supporting the widest array of high-performance and high-scale applications at HPC and AI centres.

"Creating a unified HPC and AI software stack that is both open and agnostic sounds like common sense to us, and the reason it sounds like that is because it is," Hemsoth argued.

"What is preventing us from bringing all minds to bear on solving problems instead of endlessly untangling a matrix of tools and code? Egos and near-religious adherence to preferred platforms, the not-invented-here syndrome, and lack of cooperation is the root of this particular evil and the outcome is this vicious cycle reinventing the wheel. Over and over."

After making this logical and hopeful argument for the convergence of the HPC and AI development tools and runtimes, Hemsoth conceded that having one stack is probably not going to happen, even if does make sense in the abstract, and that the best that we can hope for is a bunch of different stacks that can generate their own native code but also be converted to other platforms, much as AMD's ROCm platform can run CUDA code or kick out CUDA code that allows it to run on Nvidia GPU accelerators. Maybe Nvidia will return the favour in kind, and so might Intel with its oneAPI effort?

Then Rob Farber, who has worked at Los Alamos National Laboratory, Lawrence Berkeley National Laboratory, and Pacific Northwest National Laboratory in his long career, and is now chief executive officer at TechEnablement, blew our minds a little bit with an intricate and technical argument, espousing the idea that a unified, agnostic software environment is an admirable goal, but difficult to achieve at the source code level because no one and no single machine architecture – current or yet to be designed – can be left out.

Interestingly, Farber suggests that the key insight is that any unification might not happen at the source code level, but within a compute graph that is generated in compilers, such as those based on LLVM, that constitutes a data structure generated by the compiler, regardless of the source language, that tells data how to flow and be crunched by the hardware.

"These graphs constitute the 'software environment' that can leverage all the hardware density and parallelism that modern semiconductor manufacturing can pack on a chip," Farber explained. "Performance leverages the decades of work by compiler writers to optimize their compute graphs to maximize use of the hardware compute capabilities and minimize performance limiting external memory accesses. Parallelism can be achieved by pipelining work through the compute graph and instantiating multiple compute graphs to process data in parallel."

Dan Olds, chief research officer at Intersect360, argued pretty vehemently against the motion.

"There is no way in hell this will happen," Olds argued. "Why? Because this is a world of human beings who are working in the interests of themselves and their organizations. APIs are sources of competitive advantage for many companies and, as such, not something that those suppliers should want to completely standardize – particularly when that standard is being driven by the largest and most influential supplier in the industry."

We would add that it will be tough to get agreement when there are three major suppliers of compute engines in the data centre – Intel, AMD, and Nvidia – and agree that self-serving standards will not survive in the long run, as Olds pointed out. But the long run can be a very long time. Like decades.

We finished off the debate with me, the other co-editor at The Next Platform, arguing that a single unified HPC and AI development and runtime environment might be less desirable than we might think at first blush.

"In the simulation and modeling and machine learning sectors of the broader high performance computing sector, perhaps one day there will be a unified field, like quantum mechanics and relativity, and perhaps there will be a single programming environment that can span it all," I said.

"But for now, in a post-Moore's Law world where every transistor counts, every bit moved and processed counts, and every joule of energy counts, there is no room for any inefficiency in the hardware and software stack. And that means there is going to be complexity in the programming stack. It is an unavoidable trade-off between application performance and application portability, which we have seen play out over more than five decades of commercial and HPC computing."

History has shown that it is far easier to get standards for the plumbing in the hardware stack – interconnects and protocols and such – than it is to get them higher up in the programming environments. And without knowing what we were all writing, I agreed with my partner at The Next Platform that maybe the best that we could hope for was a level of emulation or conversion like that which is happening between AMD's ROCm and Nvidia's CUDA.

But in the end, as we face a post-Moore's Law world where it keeps getting harder to get more work done in the same thermals and in the same budget (software is getting more complex faster than hardware can keep up, which is why it costs $500m to build an exascale supercomputer instead of the $50m it took to build a terascale one several decades ago), every single piece of code in both the HPC and AI stacks is going to have to be highly tuned to drive efficiency up and thermals and costs down, and that means having a much broader diversity of hardware and consequently more compilers, more frameworks, more libraries.

Readers weigh in

One of the many Anonymous Cowards summed up many of the comments that came in for this debate thus:

"It's a nice idea - write things once and run them anywhere.

Trouble is:

- The vendors need their lock-in. AWS don't want customers to migrate to Google, neither want them to migrate to Azure. There's not many places to go after that.

You don't want to build your exciting new product with its competitive features at the behest of a competitor, let alone some other self-assigned arbiter. It would have an incredible chilling effect.

It's a nice pipe dream, but won't work in reality.

That's not to say it's impossible in a limited extent - HTML was universal enough to give us the World Wide Web, for example."

And reader ScottTx seconded this idea about vendor lock-in being the real barrier:

"Vendor lock-in. Exactly right. None of the stated reasons why this hasn't happened yet are valid. The single obstacle to this sky-pie is that it won't make anybody any money. There would be no incentive to spend the time and resources to build and maintain such a system."

And that is the real rub as far as all four of us are concerned, and many of you. ®


Other stories you might like

Biting the hand that feeds IT © 1998–2022