Sponsored Over the last two decades, enterprises have gotten datacenter management down to a fine art. Standardization and automation means improved efficiency, both in terms of raw compute and of power consumption. Technologies such as virtualization and containerization mean users and developers can make more efficient use of resources, to the point of enabling self-service deployment.
However, the general purpose x86 architectures that fuel modern datacenters are simply not appropriate for running AI workloads. AI researchers got round this by repurposing GPU technology to accelerate AI operations, and this is what has fuelled the breakneck innovation in machine learning over the last decade or so.
However, this presents a problem for enterprises that want to run AI workloads. Typically, the CPU host plus accelerator approach has meant buying a single box that integrates GPUs and x86-based host compute. This can start alarm bells ringing for enterprise infrastructure teams. Although such systems may theoretically be plug and play, they can take up a lot of rack space and may impose different power and cooling requirements to mainstream compute. They can also be inflexible, with the ratio of compute to GPU accelerator being fixed, limiting flexibility when juggling multiple workloads with different host compute requirements.
As researchers and data scientists work with ever larger models and datasets – we’re now safely in the trillion plus parameter range – keeping pace by scaling up AI horsepower becomes a question of buying another new system. Then another. And another. But this does little to address any imbalance between host compute and AI acceleration.
And the rest of your highly tuned datacenter? Well, you can’t really leverage it when all the AI workclouds are taking place in a silo. This makes seasoned CIOs and infrastructure managers understandably nervous about the prospect of highly specialized, AI-focused kit morphing into silos of stranded resource before their eyes.
So, while Ola Tørudbakken, SVP systems at Graphcore, says AI, or machine intelligence, will inevitably become the fourth pillar of the datacenter alongside (traditional) compute, networking and storage, he also says this shouldn’t be in the form of black box systems.
Rather, the UK-based AI systems maker argues, the AI firepower needs to be disaggregated from host compute and packaged in a form that doesn’t just fit easily into, but also takes full advantage of, modern datacenters.
This is where the IPU-POD system family comes in. IPU-POD is a family of system configurations based on the IPU-M2000, a 1U blade which delivers 1 petaFLOP of AI compute. Each machine contains four GC200, Mk2 Colossus Intelligence Processing Units, or IPU, each of which features 1472 independent cores, or tiles, each of which can run six concurrent threads, and which share 900MB of In-Processor memory on the same silicon.
The starting point for a CIO/CTO looking to equip their datacenter with AI compute resources that better suit how their datacenter operates are IPU-POD16 and IPU-POD64.
IPU-POD64 is composed of sixteen IPU-M2000s, data and management switches and a choice of 1 or 4 host servers that are disaggregated. It delivers 16 petaFLOPS FP16.16 compute power and occupies approximately 20RU of rack space, spread across one or more racks, depending on the server allocation. IPU-POD16 is a more introductory product for those wanting to explore and begin their innovation journey with the technology. IPU-POD16 comes in two flavours, direct attach which is a pre-configured plug-and-play system, and an alternative that uses switches. Both variants employ a single host server and deliver 4 petaFLOPS FP16.16 of compute muscle.
Scaleout is baked-in to each IPU-M2000 with Graphcore's designed in-house Gateway chip. This device supports 100Gbps communication in each direction laterally to adjacent IPU-POD racks. As a part of the IPU-Fabric interconnect these connections can be direct or via switches for added flexibility and failover mitigation. The basis for scale out is Graphcore’s IPU Fabric, with total bandwidth of 2.8Tbps. Sixteen M2000s are scaled up within a rack creating an IPU-POD64. For inter-rack communication, the IPU Fabric can be configured in a switched 3D torus topology to support thousands of IPUs and up to 1024 IPU-M2000s, creating a massively parallel system delivering 16 Exaflops of compute.
This disaggregated approach plays into the desire of both hyperscalers and enterprises to minimise the number of server SKUs they need to support. The Graphcore systems are effectively “network attached”, says Tørudbakken. Customers simply install Graphcore’s Poplar software – which supports industry standard machine learning frameworks – on their preferred servers to provide host compute for the IPU-based system. Crucially, the server capacity can be dialled up or down incrementally, according to the AI workload in question. So, while NLP type workloads typically need a lower ratio of host compute and IO, computer vision workloads normally impose higher loads on the host.
Disaggregation also has benefits in terms of resiliency, says Tørudbakken. “If you have a failure – and, servers, they do fail, right – you can take advantage of standard failover protocols provided by a networking stack. So, if a server or a virtual machine fails, then you can migrate that to another to a standby server or a standby virtual machine.”
The IPU Fabric is designed to support compiled in communications and massive scale. When data scientists compile their code using Poplar, their model is mapped across the IPU-POD system as a contiguous IPU resource, essentially as though it were one giant IPU, down to the individual cores on processors, creating the optimal comms pattern to support the machine learning model at hand, whether on one IPU or scaling up to 1000s. The result, Graphcore says, is deterministic and jitter-free communications, even as the system is scaled out to a massive extent.
But according to Tørudbakken, the user “doesn't really care that much about hardware or systems at all - they basically want a very scalable platform that they can run their code on.”
As for how fast that code runs, Graphcore’s most recent benchmarks showed the training time on BERT-Large can be more than halved using an IPU-POD system compared to a DGX A100 GPU-based system, with over three times higher throughput for inference at lowest latency. On a Markov Chain Monte Carlo workload, the IPU-M2000 ran a training job in under three hours, compared to 48 hours for the A100 GPU.
Improvements are even more pronounced with more modern models. The IPU-M2000 delivered a ten-fold increase in training throughput on EfficientNet-B4, and an 18x with an IPU optimized configuration. On inference, the IPU-M2000 delivered a 60x increase in throughput and a 16 times reduction in latency, compared a GPU system. This will no doubt please data scientists, but their infrastructure colleagues might have far more prosaic concerns, starting with questions such as “is the system running optimally and is it likely to fail?” IPU-POD systems are fully supported by a management software suite allowing systems management and observability.
Comprised of industry-standard, open source enterprise tools such as OpenBMC, RedFish DTMF, Prometheus and Grafana, they are provided with well documented, open APIs for management systems integration “Let's say a power supply is about to fail, then you want to get an alert about that early on,” says Tørudbakken. “So, you can send out a support worker to replace it or if you have DIMM that is about to fail, same thing.”
And infrastructure teams will also be very concerned that they are getting the most out of their investment in their AI gear, which means having it support as many workloads, jobs and teams as possible.
AI for anyone?
Although the IPU-POD architecture can be scaled out to tens of thousands of IPUs, creating a tempting canvas for data scientists to run their models on, as Tørudbakken says, the reality is there will be few cases where a single user is taking up all the AI compute capacity: “Typically you will have multiple users and you will have multiple tenants who just use the system.” This means robust resource management is needed. Graphcore’s software stack allows the creation of ‘Virtual IPUs’ and “that allows us to basically carve up the pod into multiple virtual pods, which are fully isolated. An IPU or system in one virtual pod cannot talk to any other IPU in an adjacent pod.”
This is allied with support for industry standard orchestration tools such as SLURM and Kubernetes, so that virtual machines or containers running on the host can be associated with VPODs in the IPU system. As Tørudbakken says, whether talking at the hyperscale level or a single rack, while the data scientist is focused on having as much power as possible to run a model, their manager or company will want some assurance that “this resource is mine and only mine.”
This plays into the expectation that many organizations might have their first experience of IPU via hyperscalers or Graphcore’s own bare metal cloud services, Graphcloud, which launched in January. However, Tørudbakken expects that many will ultimately want to bring machine intelligence on-prem, precisely for reasons of privacy, secrecy and latency. In doing so, they’ll have worked through some very traditional equations around total cost of ownership. One opex question, he says, is “How much power do you do you actually consume, to run, let's say, a BERT job? For a year? When you look both at the box level, but also at the system level, the solution is very attractive.”
From a capex point of view, the equation is even more straightforward. “The only new component that they that they really need to add to that is the IPU-M2000-basedthe IPU-POD system. The rest – the networking, and the storage – that is something that many people already have.”
And that is arguably about as undisruptive as you can be when bringing machine intelligence into your datacenter.
Sponsored by Graphcore