Why storage is key to data-intensive workloads

Traditional architectures = performance bottlenecks


Sponsored It wasn’t too long ago, perhaps only a few years, when the most data-intensive workloads and the infrastructure required to run them effectively weren’t even on the radar for the average enterprise. That has swiftly changed as the volumes of data being retained by organisations have rocketed, and the tools needed to make sense of all that data, such as advanced analytics and AI, have a lot in common with HPC workloads.

High performance computing (HPC) is all about using compute power to solve the most complex and challenging problems in business, engineering and scientific research. It is a broad area taking in the most powerful supercomputers in the world at the high end, but for the most part involves systems that will be familiar to anyone involved with enterprise IT, though typically configured at a much larger scale.

The defining characteristic of these emergent AI and analytics applications is that they involve very large data sets, often too large long for a single computer to easily crunch through it all. You could build a bigger and more powerful computer, but a more realistic solution is to connect together multiple computers into a cluster and have them crunch through the problem in parallel. Additionally, the widespread adoption of GPU-based compute systems has further parallelized data analysis to gain additional acceleration.

This is the way that even the largest supercomputers are constructed today, made up of hundreds or even thousands of compute nodes connected together using high speed networking. Each node is essentially a server, similar to the servers in a corporate data centre, with many supercomputers using nodes built around processors from the Intel Xeon or NVIDIA Tesla families, for example.

Even with this amount of processing power, the datasets involved are often so large that they will not fit into memory all in one go. In many applications, the compute nodes have to be continuously fed data from the storage subsystem, with the results being periodically written back to storage.

This means that for data-intensive workloads, the storage subsystem plays a vital part in delivering the required performance level, even more so than is the case with other applications, such as traditional enterprise workloads.

Why new workloads need different storage

Traditionally, enterprise storage has been delivered through shared network storage such as a file server or a specialised network attached storage (NAS) appliance. More mission-critical workloads, such as customer relationship management (CRM) or even email, typically rely on a database for storage, and that database will be running on a cluster of servers backed by a storage area network (SAN).

When it comes to unstructured data, a file system is the better choice because of the need to search data easily. Organisations might be tempted to rely on existing file server or NAS systems, but while these are perfectly adequate for traditional enterprise workloads, they may not be up to the job when it comes to the scale required for processing large data sets.

As the name implies, a NAS at its simplest is largely a box filled with drives and network connectivity so the storage capacity of those drives can be shared. The drawback of this is that the controller inside the NAS, which manages the drives and presents a file system to the network, effectively becomes a single point of failure for the entire infrastructure.

This arrangement also limits performance and scale because the controller inside the NAS is a bottleneck through which all read and write requests have to pass, while there is a limit to the number of drives that can be fitted into the NAS enclosure. Some enterprise NAS platforms address these restrictions by using a cluster of nodes that present as a single system, but there are still limits on scalability.

For this reason, data-intensive workloads typically call for a parallel file system, one that spreads the data across a large number of storage nodes, ideally so that each compute node can directly communicate with each storage node. This allows multiple reads and writes to happen at the same time, and the more storage nodes there are, the greater the throughput of the entire storage subsystem, which is a measure of how much data is read or written each second.

Another difference is in the mix of drives used. Enterprise storage has shifted towards storage arrays filled with flash drives - solid state drives (SSDs) - because of the low latency offered by flash memory. This translates to faster reads and writes for applications such as databases, improving application performance.

Data-intensive workloads also benefit from the low latency of flash, but the size of the datasets involved means that it is often too costly to use all-flash storage, and so a typical arrangement is to have a mix of hard drives for large storage capacity, with a smaller amount of flash storage that acts as a buffer, serving up read data for fast access and absorbing writes to be committed to the disk layer later.

Complexity can lead to increased costs

Managing all of this storage complexity can be a challenge in many environments, especially as the access pattern of reads and writes will differ between various data-intensive workloads, meaning that the infrastructure may need to be tweaked to deliver the best performance for each application that an organisation runs.

This can be seen in a study conducted by the analyst firm Hyperion Research, which found that the greatest challenges of operating storage infrastructure for demanding workloads included recruiting and training storage experts with the right skills, plus the time and cost taken up by tuning and optimisation.

More than three-quarters of the organisations surveyed by Hyperion reported episodes in the past year when storage issues had reduced productivity, with one in eight sites reporting more than 10 incidents over the previous year.

This level of unreliability can be costly, as the respondents reported that it typically saw their organisation take a hit in excess of $100,000 in lost revenue, even in cases where recovery from a storage system failure takes just a single day. This highlights the need for effective monitoring and management tools to allow admin staff to see what is happening with their storage infrastructure and deal with any developing problems before they lead to downtime.

DDN At-Scale Storage

As a company with a long pedigree in working with data-intensive workloads, DDN has built a portfolio of products for customers across a wide variety of industries, including financial services companies, manufacturing, academic research facilities, energy companies, life sciences, and healthcare.

DDN’s EXAscaler is a family of file system appliances, based on the widely used Lustre parallel file system, which DDN has been the primary developer and maintainer for since 2018. EXAscaler is designed for high performance and scalability, available in all-flash and hybrid models that combine SSDs and disk drives.

DDN A³I is a line of products aimed at delivering the storage performance needed for AI and deep learning workloads. Designed for Enterprise deployment with simple, scalable building blocks A3I makes scaling performance or capacity simple. It further combines DDN storage and data management with Nvidia GPU-based systems such as the DGX A100 system built around its Tensor Core GPU through well tested and documented reference architectures.

One customer that has chosen to deploy DDN storage is Recursion Pharmaceuticals, a startup based in Salt Lake City using AI and machine learning processes to accelerate drug discovery. It needed optimised storage infrastructure to accelerate its AI applications and eliminate bottlenecks for critical workloads.

In collaboration with DDN’s engineers, the company built a proof-of-concept using EXAScaler on ES400NV and ES7990X storage appliances with 2PB of capacity, plus an all-flash layer deployed as a front-end to the file system.

The result is robust storage that seamlessly supports 18 compute nodes and 136 GPU accelerators for the AI processing, with the flash layer delivering a 40 per cent reduction in file access time and enabling all the GPUs to get up to 100 per cent utilisation.

DDN’s experience of building scalable data management and storage platforms for very demanding workloads means that the firm is well positioned to assist businesses that are now getting to grips with the challenges of integrating data-intensive techniques such as machine learning and advanced analytics into their enterprise workflows as they try to keep ahead of the competition.

Find out more information about DDN products and use cases at this year’s NVIDIA GTC virtual conference running April 12-16. Visit the GTC page to get updates on registration and programming as it becomes available.

Sponsored by DDN


Biting the hand that feeds IT © 1998–2021