DeepMind working on distributed training of large AI models
Alternate process could be a game changer if they can make it practicable
Is distributed training the future of AI? As the shock of the DeepSeek release fades, its legacy may be an awareness that alternative approaches to model training are worth exploring, and DeepMind researchers say they've come up with a way of making distributed training much more efficient.
DeepSeek caused an element of panic in the US tech industry, as its AI appeared to perform as well those from OpenAI and Meta, while the company claimed to have trained its models at a much lower cost (this is hotly disputed by many) using fewer Nvidia GPUs.
While many doubt the veracity of these claims, the model's release caused the tech industry to take a step back and reconsider the strategy of spending tens of billions of dollars on ever larger models trained using ever larger clusters of AI servers stuffed with eye-wateringly expensive GPUs, all contained in ever larger energy-guzzling datacenters.
Google's DeepMind subsidiary has since published research discussing how to distribute training of models with billions of parameters among clusters of computers that could in theory be widely separated, while producing the same level of quality as before.
In a paper available online titled "Streaming DiLoCo with overlapping communication," the DeepMind researchers build on the company's existing DiLoCo (Distributed Low-Communication Training) approach, but with several modifications to make training on "islands of devices that are poorly connected" a more viable prospect.
The issue, as the paper outlines, is that large language models (LLMs) may require tens of thousands of GPU accelerators to train them, and this number keeps increasing as the models become more complex.
Building and maintaining a datacenter which can cram in that many accelerators is expensive and leads to increasingly complex engineering challenges, the researchers state, not least of which is the networking interconnects and cooling required.
Work on this aspect is also in progress elsewhere, with The Register recently reporting on how industry giants such as Nvidia are looking into connecting separate datacenters together to form a bigger virtual datacenter that allows AI models to grow ever larger.
Beyond the physical infrastructure, "orchestrating the passage of gradients, parameters and intermediate states between these devices at each optimization step, while keeping all devices fully utilized is technically challenging from a software engineering perspective," DeepMind explains.
Data synchronization and consistency are critical in distributed LLM training, but when you are talking large models, network bandwidth and latency can significantly impact on performance.
One way to tackle this is to boost network performance, as Nvidia has notably focused on. The company recently talked up the capabilities of its Spectrum-X technology.
DeepMind's approach with DiLoCo has been to relax the need for co-location of training clusters by creating distributed groups of "workers," where synchronization between workers happens infrequently. This is intended to enable connection by lower bandwidth communication links without affecting learning quality.
Streaming DiLoCo adds three proposed modifications to further tweak its performance: synchronizing subsets of parameters on a schedule, rather than all of the parameters at one go; overlapping worker compute time with the communication of synchronizations; and, lastly, adjusting quantization on the outer gradients to four bits per parameter. The last modification, it claims, can reduce the amount of data needing to be exchanged without loss of performance.
According to the researchers, the paper demonstrates that the new approach is capable of achieving training with comparable performance as a classical data-parallel method, while using 400 times less bandwidth.
Jack Clark, co-founder of Anthropic and a former Reg reporter, notes that DiLoCo is worth paying attention to.
"Prime Intellect's "INTELLECT-1" 10 billion parameter model was trained in a distributed way using OpenDiLoCo, an open source variant of DeepMind's DiLoCo approach," Clark says in his Import AI newsletter.
Streaming DiLoCo works well, allowing for that dramatic reduction in bandwidth requirements while exhibiting a negligible impact on model quality, he adds.
"In training simulations at the 1B, 10B, and 100B parameter model scale, they show that streaming DiLoCo is consistently more efficient than vanilla DiLoCo with the benefits growing as you scale up the model," Clark says.
His vision for where the technology might lead is one where countless models are being trained continuously "each having its roots in a thousand or more distinct computers separated by sometimes great distances," thereby democratizing AI development from the hands of the mega corporations with the resources for ever larger compute farms.
Gartner VP Analyst Chirag Dekate is a little more down to earth, and simply notes the progress that distributed training is making.
- Google torpedoes 'no AI for weapons' rules
- Google DeepMind CEO says 2025's the year we start popping pills AI helped invent
- Google Gemini 2.0 Flash comes out with real-time conversation, image analysis
- We can clone you wholesale: Boffins build ML agents that respond like specific people
"Techniques like quantization (mixed precision arithmetic) and overlapping (compute and communication to hide effects of latency) are finely designed engineering attributes designed to overcome limitations of underlying accelerators. Most accelerators today are bottlenecked at memory, memory bandwidth and IO bandwidth layers," Dekate observes.
"Using techniques like the ones used by DeepSeek and Google DeepMind are now becoming the norm. The net effect of this is improved scalability, while utilizing underlying AI supercomputing resources more efficiently. So both models and the AI supercomputers can deliver greater scalability, and together, they can deliver even more powerful AI," he states.
But Streaming DiLoCo is regarded by DeepMind's researchers as merely a first step towards "a distributed free lunch," and there is still need for further development and testing.
"There are huge opportunities for bringing the ideas from the federated learning literature to the new world of large scale training for LLMs," the paper states, although it adds that "a critical next work is to study how new distributed methods like ours should be tuned and scaled across multiple axes."
In particular, vital work needs to be done to pin down how to efficiently scale the number of DiLoCo replicas for an equivalent token budget, it concludes. ®