Calculating the big picture: Future HPC efforts will soon see off its von Neumann past
Dear John, I'm leaving you for a robot
Feature High-performance computing (HPC) has a very different dynamic to the mainstream. It enables classes of computation of strategic importance to nation states and their agencies, and so it attracts investment and innovation that is to some extent decoupled from market forces.
Sometimes it leads the mass market, sometimes it builds on it, and with the advent of massive cloud infrastructure, something like a supercomputer can even be built from scratch through a browser, providing you have a high-performance credit card.
But national supercomputer efforts are where innovation of technologies, perhaps with wider applications in their future, is pushing ahead. The current goal is exascale, computers capable of sustained 1018 floating point instructions per second (FLOPS) measured by standard benchmarks. Everyone's in the race.
South Korea is aiming for that by 2030 using mostly locally designed components while the Japanese national computational laboratory, Riken, currently hosts the world's fastest supercomputer – half-exascale – built around Fujitsu's Arm-based A64FX processors. Europe's main independent supercomputer project will initially use Arm's own V1 chips, but with a move over time to home-designed RISC-V processors. Locally designed RISC-V processors are also at the heart of India's own national supercomputer project.
All these designs are more-of-the-same amplifications of von Neumann architecture, named after John von Neumann who worked on the world's first general-purpose programmable electronic computer, 1945's ENIAC. By dint of zero competition, this was also the world's fastest supercomputer. It established the model of state-funded basic research at the cutting edge with strategic aims – it was used for artillery calculations and to check the feasibility of the hydrogen bomb.
In Von Neumann architecture, a central processing unit gets code and data from memory, and writes data back, via a common address bus, with mass storage and IO. To date, HPC has largely grown in power by increasing the speed, throughput, and capacity of each of these components, largely driven by Moore's Law.
While the 2021 Arm V1 uses techniques like very high-bandwidth memory buses, wide fetch and instruction issue engines, and specialist vector engines to achieve performance approximately 100 billion times faster than ENIAC, the two designs share that common basic architecture. That is expected to end.
HPC has a history of using exotic technologies as investment in fastest-at-any-cost ideas is easier to get through its importance to industry, state-funded research, and computational needs of government agencies open and covert. This was most apparent in the early days of supercomputing, when specialist semiconductor logic unsuited for the mass market powered best-of-class designs like the Cray 1. As off-the-shelf circuits picked up power, the trend towards seas of standard chips running industry-standard software took over, and is now the standard model.
But with the slow demise of Moore's Law and the development of new classes of computing tasks, new technologies and architectures are once again being researched, and once again supercomputing is predicted to use them before they become suitable for general purpose. This is coming about through synergies between emerging devices, a sea change in where supercomputing is expecting to get its performance increase from. One area attracting long-term interest as it illustrates all these ideas is in hybrid silicon-memristor neuromorphic circuits.
Neuromorphic designs are those based on neural processing systems found in nature. AI and machine learning takes some of those ideas, most notably the feed-forward learning neural networks often implemented in software on CPU and GPU hardware, and increasingly in custom accelerator circuits. One of the features of such networks is that the computational requirements of each node in the network can be very light – you don't need lots of CPU per node, though you do want lots of nodes.
Enter the memristor, a circuit component that's been known about for decades, and one that changes its state by altering the magnetic polarisation of a tiny structure within it. It hasn't found much use because it's not very fast; it does remember its state without power, but common silicon logic and memory designs have always been much cheaper and faster. Other attributes of memristors, such that the near-analogue way they can hold a wide range of values, not just 1 or 0, are also hard to exploit in the way we compute today.
But if you use layers of memristor-based components configured as parts of nodes in an on-chip neural network, interconnected by fast silicon, you can achieve very high densities at very low power – and the logic becomes its own memory.
Moreover, the memristor component's wide range of states makes it intrinsically good at holding computational weight, a key aspect of learning networks that stores the importance of a particular signal in making the overall decisions.
- First RISC-V computer chip lands at the European Processor Initiative
- 'Large-scale computing' needs a government team driving it, says UK.gov
- HPE bags $2bn HPC-as-a-service gig with the NSA
- Google demonstrates impractical improvement in quantum error correction – but it does work
The reason research like this is intensely interesting to supercomputer designers is that while the classical Von Neumann systems are running out of road, AI/ML is doing quite the opposite. In the past 10 years or so, that AI/ML has been a unified field rather than a mix of fiefdoms doing vision recognition, natural language processing, and so on, it has started to turn up unexpected results, some with applications in numerical analytics that would not hitherto seem a good fit.
Handing over to AI/ML
A notable paper from Edinburgh researchers in 2019 took a classic numeric problem that defied formulaic analysis and could only be approached by massive computation, the three-body problem in chaotic orbital mechanics, and trained an AI on a large dataset of known solutions.
The resultant trained network could then not only solve previously unsolved three-body problems, but in some cases do so 100 million times faster than the standard approach. That's the equivalent of more than two decades of Moore's Law.
There are many examples in many fields where AI/ML techniques act as accelerants for HPC. In fact, one of the most detailed road maps for how the two will mingle over the next decade is AI For Science, a report led by the US Argonne National Laboratory (which was due to have the US's first exascale computer, Aurora, until Intel couldn't make the chips). Despite its name, the report is a detailed investigation into supercomputer development across most fields, bringing together the work of around a thousand researchers.
It predicts increasing hybridisation between classic supercomputers and AI. With increasing hybridisation comes increased use of workflow as a primary model for task creation and management; instead of gathering all the data in one place and then operating on it, cascading operations happen as data flows in from the edge of the system to be used in an overall model.
However, it cautions, architectural changes will be needed on the classical side to fully realise the potential. Current high-performance memory and storage systems expect workloads with relatively small inputs and large outputs, with predictable, contiguous, block-based operations. Current AI training workloads, in contrast, must read large datasets, often petabytes, repeatedly and not necessarily in order, for training. AI models will need to be stored and dispatched to inference engines, which could look like small, random operations. This could impact performance considerably if not designed properly.
And the need to rethink workload management will increase with specialised AI hardware cooperating with traditional systems to train models that are dispatched to low-power devices at the edge.
Further out, by 10 years, the Argonne roadmap is expecting the last generation of supercomputers that are recognisable descendants of previous designs. They will be heavily hybrid, with AI/ML doing much of the system management, model generation, and code generation/optimisation, and with million-qubit quantum computing – perhaps – providing useful assistance. If they are useful, they'll be used in simulation, and mostly by the AIs, which will know how to generate optimised code for a task.
Argonne expects 100 exaFLOPS to be achievable by then, with the first zettascale – 1,000 exaFLOPS – designs on the drawing board, though as Argonne director Professor Rick Stevens told the Future of Information and Communication Conference in 2020: "It's not clear how to account for quantum computing and neuromorphic components."
They could come a lot quicker, he said. "And everything that touches data will have some intelligence."
AI will be integrated into storage to provide useful abstractions of data alongside bit-for-bit save and retrieve, and able to fill in gaps of missing data with synthetic data, which would not be without its problems.
By now, predicts Stevens, the non-Von Neumann parts of supercomputing will be taking the lion's share of developmental effort and power, becoming the primary technologies in the field. The unique needs of the supercomputer, it seems, that brought the Von Neumann architecture into being and gave it a life that will easily see its centenary, have also set in motion the causes of its demise, and for the first time in modern computing history we can see the shape of things to come afterwards. ®