Backgrounder Do we need a new processor architecture? Graphcore says that machine learning computation is different from existing computational types, and will be broad enough in its usage for – as well as accelerated significantly by – a dedicated processor architecture.
If UK-based Graphcore is right, then its IPU will join CPU and GPU in a processing architecture triptych. We take a look here at what it is saying about this topic to attempt to see if it is justified.
Graphcore has built an Intelligent Processing Unit (IPU) to run machine learning applications, such as those for deep neural nets. Graphcore says the idea of a computational graph is used as an abstraction for how machine learning works, with the graph describing a machine learning model.
Why Graphcore IPU?
Graphcore's literature emphasis the differences between standard server processors, such as x86 Xeon line; Graphics Processing Units (GPUs), such as those from Nvidia; FPGAs, as used by Microsoft in its machine learning initiatives; and its own graph processing-optimised IPU.
It says that the IPU has been optimised to work efficiently on the extremely complex high-dimensional models that machine intelligence requires. It emphasises massively parallel, low-precision floating-point compute and provides much higher compute density than CPUs or GPUs.
CPUs are characterised as scalar processors.
GPUs are high-precision, parallel vector processing devices designed to process data in 2D images or matrices.
FPGAs are arrays of blocks of logic that can be programmed to accelerate specific functions in hardware. However, FPGAs are inflexible, Graphcore claims, plus also difficult to write software for, are power-hungry and have relatively low performance.
Google has its Tensor Processing Unit (TPU) ASIC and TensorFlow software library for its machine learning work. The TLU is used to accelerate machine learning in conjunction with CPUs and GPUs.
Graphcore says its IPU offers much higher compute density in machine learning that these other approaches.
CTO Simon Knowles has this view of GPUs: "GPUs are currently much faster than CPUs for deep learning where dense tensors are being manipulated, which suits their wide vector datapaths. In fact, some CPUs like Xeon-Phi have evolved to look more like GPUs."
"But only a subset of machine intelligence is amenable to wide vector machines, and the high arithmetic precision required by graphics is far too wasteful for the probability processing of intelligence. So GPUs are actually very inefficient, but today they are what exists. In the near future we will certainly see more focused alternatives."
The IPU holds the complete machine learning model inside the processor, with computational processes and layers being run repeatedly, and has over 100x more memory bandwidth than other solutions. This results in both lower power consumption and higher performance. Graphcore says this means IPU-using systems train machine learning models much faster and deploy them for inference or prediction work more efficiently than other processor types.
Holding the model inside the processor means that external DRAM is not involved once the IPU has been set executing a model. That implies clock cycles are not used up during processing to move data in and out of the processor, as is the case with CPUs.
A graph process is not producing a two-dimensional chart. It is, very broadly speaking, producing a judgement (probabilistic answer) about something, not a definite and accurate output. To do so, it runs many computational processes (points or vertices) and calculates the effects these vertices, which could have a probability rating, have on other points with which they interact via lines (called "edges" in the trade). Vertices and edges are organised into layers. There is likely to be intensive communication between processes inside a layer and less communication between layers.
The overall processing works on many vertices and points simultaneously, and low precision is needed; Graphcore says small words can be used for such half-precision floating-point numbers.
Knowles explains how he sees this: "Intelligence, human or machine, involves two essential capabilities. The first is approximate computing – efficiently finding probably good answers where perfect answers are not possible, usually because there is insufficient information, time, or energy. This is what we call judgement in humans.
"The second capability is learning – adapting responses according to experience, which is just previous data. In computer or humans, learning is distilling a probability distribution (a model) from previous data. This model can then be used to predict probable outcomes or to infer probable causes.
"These models are naturally graphs, where the vertices represent the probabilities of features in the data, and the edges represent correlation or causation between features."
Graphcore has Poplar, a graph compiler which takes standard operations used in machine learning frameworks, such as Google's TensorFlow, MXNET, and Caffe, compiles them into application code for the IPU.
For example, one deep neural network is based on the Alexnet architecture, used for image classification. The Poplar compiler took a description if an AlexNet network and produced a graph with 18.7 million vertices and 115.8 million edges. We are talking big numbers here. The graph is, in effect, a highly-parallel execution plan for the IPU.
Graphcore execs think the IPU can increase the speed of general machine learning workloads by 5x and specific ones, such as autonomous vehicle workloads, 50 - 100x.
They assert that GPU machine learning workload performance increases by 1.3 - 1.4x per two-year period, a much slower improvement rate than can be realised with its own IPU..
Principal analyst Linley Gwennap of The Linley Group opines, “Machine intelligence and deep learning applications are now popular enough to justify new silicon approaches."
We don't know any details of how the Graphcore IPU works. This is Graphcore's own IP and it isn't revealing any details.
Graphcore CEO Nigel Toon says: "IPUs have a structure which provides efficient massive compute parallelism hand in hand with huge memory bandwidth. These two characteristics are essential to the delivery of a big step-up in graph processing power, which is what we need for machine intelligence. We believe that intelligence is the future of computing, and graph processing is the future of computers."
Knowles provides am insight into how he looks at machine learning today: "There are intelligence tasks (training, inference, or prediction) that would ideally happen on the cellphone or remote sensor but are too compute constrained locally, so currently rely on uploading data to the cloud for processing."
That's what he means when talking about machine learning running at the edge.
There is a three-stage roadmap. Graphcore has started out by building an IPU appliance to be delivered later this year. The appliance will supposedly speed machine learning operations by between 10 and 100 times compared to the best systems in use currently.
It will then develop an IPU accelerator card using a PCIe connection and ultimately hopes to build IPUs for embedding into Internet of Things edge devices, such as autonomous cars, collaborative robots, and language translators, so that such edge devices can run machine learning application code in situ and provide real-time responses to incoming events.
We haven't seen data comparing how a machine learning application runs on a CPU, CPU+GPU, TLU, FPGA, ASIC and IPU. Until we do, we won't know whether the IPU approach is justified in terms of run time and energy efficiency.
Graphcore will almost certainly have some of this information available to it from its own internal testing. We envisage that data being released, assuming it is in Graphcore's favour, when the IPU appliance is announced later this year. After all, to cite the bleedin' obvious, there have to be reasons to buy the thing.
If it does deliver 10-100 times faster execution of machine learning applications, then Graphcore has a future. If it doesn't, it doesn't - if you see what we mean. We hope it does, as that will spark renewed processor developments. ®