Sponsored Artificial intelligence (AI) started out as laboratory research, but today’s AI techniques like machine learning and deep learning are increasingly finding their way into real-world applications such as detecting fraudulent activity in financial transactions, analysing retail data to deliver a personalised shopping experience, or finding the optimal route for delivery vehicles to take.
These trends mean that AI is rapidly becoming an integral part of many enterprise workflows, from email to CRM or ERP, and especially data analytics to glean business intelligence insights from an organisation’s own datasets on how to drive efficiencies or even create new business opportunities.
Not surprisingly, a recent IDC survey showed that 85 per cent of organisations are evaluating AI or already using it in production. However, IDC found that most organisations reported experiencing failures among their AI projects, with a quarter reporting a failure rate of up to a 50 per cent. These failures have been attributed to a lack of skilled staff or cultural challenges, but another common cause of failure is the inability to scale from a proof-of-concept project to running in a production capacity that can serve the entire organisation.
Taking a step back, it is important to differentiate between organisations that are investing in AI in order to gain a significant strategic advantage or perhaps reinvent their entire business model, against those that are simply looking to streamline operations or use AI to automate certain tasks. For the latter companies, an ecosystem of cloud-based AI functions is already springing up that that can be accessed via APIs and integrated into business workflows.
The other kind of company is exemplified by a bank spun out from the ecommerce giant Alibaba, which uses AI to run all of its financial services, allowing it to move much faster than rivals when it comes to processes such as loan approvals, and with a fraction of the number of employees. As detailed in the Harvard Business Review, the core of this new bank is a ‘decision factory’ based on AI that treats decision making as a science, using data to drive predictions and insights that guide and automate the company’s operational workflows.
For this kind of business, implementing your AI strategy starts with having the right data as well as understanding how to use it. This means employing not just data scientists but also data strategists, who are professionals with the ability to translate business problems into analytical solutions and insights.
AI is built on data
Data - vast amounts of it – is the ultimate raw material for developing machine learning (ML) or deep learning (DL) algorithms. The more sample data that you can throw at a model to train it, the better the model will become and the more accurate and reliable its output will be. The upshot is that storing and processing data for AI projects often calls for hardware that has more in common with high performance computing (HPC) installations than traditional enterprise IT environments.
Training a deep learning model or analysing vast quantities of data calls for a substantial amount of processing power. This can be accomplished by using a bunch of servers with high-end processor chips crunching away at the problem in parallel. But a more effective solution is to turn to specialised accelerators such as the GPU, or Graphics Processing Unit.
GPUs get their name because they started out as accelerators for 3D graphics, where millions of repetitive calculations are needed to render images. For this reason they have a massively parallel architecture using hundreds of simple processing cores - which turn out to be suitable for the calculations involved in AI models as well.
For example, Google discovered that an AI system developed for online image recognition and which called for 16,000 CPUs could be handled by using just 48 Nvidia GPUs.
This does not mean that the GPU entirely displaces the CPU in compute infrastructure designed for AI workloads. In many cases, CPUs are still required to handle the application logic and other data science calculations, and so compute nodes with a combination of CPUs and GPUs will prove to be the optimal solution in most cases.
Storage feeds compute
As in traditional HPC architectures, the key to optimal performance is keeping the compute nodes and their GPUs fed with data at a high enough rate to keep them busy, which means that the storage infrastructure plays a vital part in delivering the required performance level. The right data storage system must deliver high throughput, in order to prevent the costly GPUs standing idle, but it must also be flexible and scalable.
Complicating matters, differing AI workloads will display different access patterns in the way they read and write data, and the storage layer needs to be able to cope with all of them. ML training workloads tend to follow an unpredictable access pattern, for example, generating lots of reads and writes that may comprise both random and sequential accesses of varying sizes, and the storage layer must be able to absorb these as well as delivering high throughput.
When the training dataset is small enough, such as in a pilot deployment, it might be cached in local memory or served up from local flash drives (SSDs) in a small cluster of compute nodes, and this can deliver an adequate level of performance, especially if the flash SSDs are NVMe drives.
NVMe is a storage standard that uses the high speed PCIe bus to link SSDs directly to the processor in a system instead of a legacy interface such as SAS or SATA. It also specifies a new efficient protocol that cuts down on software overheads and therefore maximises the low latency that flash storage offers. A key feature of NVMe is support for multiple I/O queues - up to 65,535 – enabling flash storage to service multiple requests in parallel. This takes advantage of the internal parallelism of NAND storage devices and allowing for much higher raw throughput than SAS or SATA.
However, scaling such a pilot deployment to support the volumes of data needed for production AI use cases can be difficult and/or costly, and this is a likely reason why some AI projects fail to move beyond proof-of-concept stage.
Cost also plays a part. Many all-flash storage architectures rely on a separate object storage pool, or similar, to keep less frequently accessed cold data. In contrast, storage firm DDN has a feature called Hot Pools that lets users keep everything in the one file system by automatically migrating data between a flash tier used for hot data and a larger spinning disk tier used for cooler data. This reduces costs due to management overheads while keeping all data closer to hand.
Accelerated, Any-Scale AI
A good example of this can be seen in the Accelerated, Any-Scale AI (A³I) portfolio from DDN, a company which has specialised in high performance storage for over two decades. The A³I line-up is a set of pre-configured appliances based on DDN’s EXAScaler system, with a choice of all-flash NVMe SSDs or a hybrid mix of flash for speed and hard drive storage for high storage capacity.
To scale, customers can simply add additional appliances, with up to 256TB of all-flash NVMe capacity per AI200X/AI400X appliance, or 4PB of hybrid storage in the AI7990X model. Each can be regarded as a building block which can be aggregated into a single filesystem that can scale in capacity, performance and capability.
According to DDN, the A³I appliances are optimised for all types of access patterns and data layouts, in order to ensure full GPU resource utilisation. Each appliance also has multiple high-speed host interfaces, with up to eight ports of either HDR100 InfiniBand or 100Gbit/s Ethernet.
Certified for AI infrastructure
In recognition of this, leading GPU supplier Nvidia has included DDN A³I storage in reference architectures alongside its DGX A100 system, a dedicated AI compute system that contains eight of its latest A100 Tensor Core GPUs along with a pair of AMD Epyc CPUs. Styled as a universal system for all AI workloads, the architecture of the DGX A100 system enables it to consume an impressive volume of data – up to 192GB/s. However, four DDN AI400X storage appliances working in parallel are capable of keeping all thosethe GPUs fully saturated with data.
The DGX A100 is fairly new, but customers have already been using DDN storage with Nvidia’s older DGX-1 platform in AI applications. Japan’s Tohoku University Medical Megabank Organisation (ToMMo) has implemented DDN EXAScaler storage connected to a DGX-1 GPU-based analysis server running the Parabricks genomic analysis software as part of its medical supercomputer system.
According to the university, this has greatly boosted its analytical capacity and sample sizes. By being able to deal with a much larger dataset, methods that previously only existed in theory have now become viable for use, enhancing the accuracy of its data analyses.
The lesson is that to transform business operations using AI, organisations need to be able to deal with massive amounts of data. This in turn means building an infrastructure capable of handling that volume of data, as well as having a way to scale access to data and compute resources without breaking the bank, in order to support future growth.
Companies seeking to push ahead of their rivals through the adoption of a comprehensive data strategy need the assurance that they are not taking on additional risks with their infrastructure. Choosing a storage supplier with a pedigree in serving the most demanding data intensive environments with a portfolio of solutions is the sensible place to start.
Sponsored by DDN