How to future-proof your AI on HPC

Why you don’t need specialised hardware makers to benefit from high-performance computing

Sponsored High performance computing (HPC) and AI have been converging for several years. AI offers great promise thanks to its ability to deliver new levels of insight into a range of fields, by processing huge volumes of data against specialised models. HPC, thanks to its scale and power, offers the ability to process data at a level demanded by complex, data intensive AI workloads’.

But while you might be keen to embrace AI, even an entry level HPC system can start at tens of thousands of pounds and ultimately see you spring for proprietary systems in the six-figure range. So, what happens if you don’t have the IT budget to dedicate to the kind of super cluster that’s prescribed as necessary to crunch the data demanded by HPC infrastructure capability?

Fortunately, you don’t need specialised hardware from the world’s supercomputer makers to obtain the benefits of HPC.

Indeed, if you look at the Aurora system being developed for the Argonne National Laboratory that is expected to become the world’s first operational exascale computer, you’ll find it’s being built using Intel® Xeon® Scalable processors along with other components that will be familiar to many an enterprise IT shop already running racks of industry standard servers.

Translated, this means HPC is potentially within the grasp of many. Indeed, many organisations will have already deployed some form of HPC infrastructure in order to operate analytics engines such as Apache Spark to make sense of all that data and transform it into insights that the management can act upon. One application of AI here is to add even greater value to analytics through the use of machine learning/deep learning to predict future events based on historical data.

The first instinct of many IT departments might be to build separate infrastructure for AI processing, but as the example just mentioned demonstrates, the real benefits of AI will come from integrating it into existing workflows, which would be hampered by running simulation, modelling, AI, and analytics workloads on separate server clusters. In addition, specialised infrastructure using hardware accelerators may be costly, and may soon be made redundant by the pace at which AI is evolving at the moment. Intel®, however, has invested heavily in boosting AI performance on its Xeon® Scalable processors, adding new AI instructions that are specifically designed to accelerate operations commonly used in AI workloads.

The second generation Xeon® Scalable processors introduced what Intel® calls Intel® Deep Learning Boost (Intel® DL Boost), comprising new Vector Neural Network Instructions (VNNI), which are specifically designed to accelerate the kind of calculations involved in convolutional neural networks, for example.

VNNI delivers performance improvements by combining three instructions that can be execute at the same time, and by stepping down from 32-bit floating point numbers to 8-bit integers (INT8), delivering better memory and compute utilisation by allowing more data to be crammed in for inference workloads.</p?

The next generation of Xeon® Scalable processors is set to bring further enhancements, with new instructions supporting the bfloat16 number format. Otherwise known as “brain floating-point format”, this was developed by Google and is attractive for use in deep learning because th.e range of values it can represent is the same as that of a 32-bit floating-point number, but as it is half the size, twice as much data can be crammed in, thereby providing twice the number crunching throughput per clock cycle.

In other words, Intel®’s Xeon® Scalable processors are ramping up their performance in AI processing through successive hardware accelerator features, coupled with higher core counts and software optimisation.

This means that existing HPC clusters built around general-purpose industry standard hardware represent a good starting point for flexible, cost-effective performance across a range of different workloads.

The right tools

It’s good to get an understanding of this now. Many enterprises are only just starting out on their AI journey. According to a 2019 report from KPMG [PDF], just 17 per cent of companies reported use of AI or machine learning at scale. Half of those, however, stated expect to be using AI/machine learning at scale within three years.</p

When it comes to architecting and deploying HPC systems ready for AI, these organisations will need to think beyond purely run-time hardware and consider the need for specialised software that can be used to build and train the AI models.

Commonly used frameworks for machine learning include Google’s TensorFlow library, PyTorch, Caffe, and Apache MXNet. Crucially, many of these (and others) are available in versions that have been optimised to improve processing performance when run on Intel® Xeon® Scalable server systems.

For example, optimisations to TensorFlow were the result of a close collaboration between Intel® and Google, and have been shown to deliver results with orders of magnitude faster than using non-optimised code - up to 70 times higher for training and 85 times higher performance for inferencing.

Getting to this result involved refactoring the code to take advantage of vector processing instructions in modern CPUs, such as AVX-512 in Xeon® Scalable processors; paying special attention to all the available cores to maximise parallel execution; and careful use of prefetching and caching data.

These changes rely heavily on the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN), which provides a set of deep learning primitives optimised for Xeon® Scalable processors. Many other frameworks, such as those above, are available in optimised versions that use the Intel® MKL-DNN in optimised versions.</p.

Intel® has already gone down this route for HPC with the similar sounding Intel® Math Kernel Library (Intel® MKL), which provides optimised code for functions such as fast Fourier transforms or statistical functions. This is used to power BigDL, a distributed deep learning library for Spark that enables AI applications to be built directly on top of existing Spark or Apache Hadoop clusters.

Intel® has also developed AI toolkits of its own, such as OpenVINO (Open Visual Inference and Neural Network Optimisation). This is aimed at developers building deep learning visual processing applications. It comprises a Model Optimiser that can take a pre-trained model output from the likes of Caffe or TensorFlow, plus an Inference Engine to execute the model on Intel® architecture systems.

Lastly, Intel® is developing a unified programming model aimed at enabling data-centric workloads. Currently in beta, oneAPI is intended to abstract away the hardware and offer a single set of APIs for various HPC and AI functions.

Don’t forget data access

CPU performance is critical for HPC AI workloads but it’s important not to overlook storage and memory subsystems too, since you will need to keep the processor fed with data as fast as it can take it, or performance stalls.

On the memory side, large capacity is beneficial when working with large datasets, as is typical for HPC workloads or training machine learning models. DRAM, however, is still costly, and filling the memory slots in servers with the largest capacity DIMMs available would be prohibitively expensive for most organisations.

One way of getting around this is to use a combination of DRAM and Intel® Optane™ DC Persistent Memory modules, a type of non-volatile memory that can be accessed like DRAM and DIMMs of which are available in higher capacities at a lower cost per gigabyte. While Optane™ DIMMs are slightly slower than DRAM, the ability to cram much more data into memory speeds up data-intensive workloads that are I/O bound, as the number of fetches from disk is reduced.

Storage is critical, and HPC sites have been moving towards employing a very fast flash layer, back-ended by less speedy flash or even disk drives that are less costly and provide a very large capacity tier.

For enterprises, SSDs are already the way forward, especially with the advent of the NVMe protocol that allows drives to attach directly to the PCIe bus and cuts through all of the cruft of the legacy storage stack to deliver low latency and high throughput.

To get the performance required for AI, though, organisations should consider using SSDs with the lowest possible latency and very high throughput, especially for servers doing training of models. A good choice would be Intel® Optane™ DC SSDs in place of standard NAND-based SSDs, which also boast high quality of service (QoS) and high endurance characteristics.

Intel® also has plans to refresh its Optane™ technology. The next generation of Optane™ DC SSDs, codenamed Alder Stream, are in the pipeline, along with future Optane™ DC Persistent Memory modules codenamed Barlow Pass.

The Alder Stream Optane™ DC SSDs are expected to perform about 50 per cent better than the current generation, while the next generation Optane™ DC Persistent Memory modules are set to provide higher capacity, perhaps 1TB per DIMM, offering an upgrade path to greater performance.

Looking to the future, Cray’s Aurora supercomputer is set to be powered by a forthcoming Xeon® Scalable processor family, which is expected to come with further Intel® DL Boost capabilities to further accelerate its AI processing, with Intel® hinting that this may offer a significant improvement in performance over the current generation.

What does this mean for the everyday enterprise? It means that HPC isn’t something for the super leagues of IT. Intel® has added, and seems likely to continue adding, features and capabilities to its Xeon® Scalable platform that will ratchet up AI performance with successive generations. For those getting ready to deploy AI, investment in Intel®-based HPC infrastructure will provide a sound foundation for their future.

Sponsored by Intel®

Biting the hand that feeds IT © 1998–2020