Why GPU-powered AI needs fast storage
Don’t let I/O bottlenecks impede training your AI models
Advertorial The process of training modern AI models can be incredibly resource-intensive – a single training run for a large language model can require weeks or months of high-performance compute, storage, and networking, even with the parallel processing capabilities of graphic processing units (GPUs).
As a result, many organizations are expanding their compute, storage, and networking infrastructure to keep up with AI-driven demand.
But there's a problem. AI training workloads operate on massive data sets, so it's crucial that the storage system can transfer data fast enough to prevent the GPUs from being starved.
IBM Storage Scale System 6000 has been engineered to address these performance-intensive requirements. It helps speed up data transfer by using the NVIDIA GPUDirect Storage protocol to set up a direct connection between GPU memory and local or remote NVMe or NVMe-oF storage components, removing the host server CPU and DRAM from the data path.
IBM Storage Scale software runs on Scale System 6000 hardware and uses a POSIX-style file system optimized for multi-threaded read-write operations across multiple nodes as an intermediate caching mechanism between the GPUs and object storage. This active file management (AFM) capability is designed to allow the data to be loaded into the GPUs faster whenever a training job is started or restarted, which can become a significant advantage when running AI training workloads.
If a model training process were to be interrupted by a power outage or other error, the entire training run would usually need to be started from scratch. To safeguard against this, the training process stops from time to time to save a checkpoint – a snapshot of the model's entire internal state, including weights, learning rates and other variables – that allows training to be resumed from its last stored state rather than from the beginning.
However, checkpoint storage requirements increase in step with model sizes, and some large language models are trained on literally trillions of tokens. The active file management capabilities in Storage Scale are critical here, enabling training workloads to resume more quickly from the latest checkpoint. For a multi-day or multi-week training run, that can have a major impact.
As organizations build AI-based applications to deliver new kinds of business capabilities, foundation models are likely to continue increasing in complexity and size. That's why GPU clusters need to be paired with data storage systems that won't let I/O bottlenecks impede that progress.
Contributed by IBM.