This article is more than 1 year old
For the average AI shop, sparse models and cheap memory will win
Massive language models aren't for everyone, but neither is heavy-duty hardware, says AI systems maker Graphcore
As compelling as the leading large-scale language models may be, the fact remains that only the largest companies have the resources to actually deploy and train them at meaningful scale.
For enterprises eager to leverage AI to a competitive advantage, a cheaper, pared-down alternative may be a better fit, especially if it can be tuned to particular industries or domains.
That’s where an emerging set of AI startups hoping to carve out a niche: by building sparse, tailored models that, maybe not as powerful as GPT-3, are good enough for enterprise use cases and run on hardware that ditches expensive high-bandwidth memory (HBM) for commodity DDR.
German AI startup Aleph Alpha is one such example. Founded in 2019, the Heidelberg, Germany-based company’s Luminous natural-language model boasts many of the same headline-grabbing features as OpenAI’s GPT-3: copywriting, classification, summarization, and translation, to name a few.
The model startup has teamed up with Graphcore to explore and develop sparse language models on the British chipmaker's hardware.
“Graphcore’s IPUs present an opportunity to evaluate the advanced technological approaches such as conditional sparsity,” Aleph Alpha CEO Jonas Andrulius said in a statement. “These architectures will undoubtedly play a role in Aleph Alpha’s future research.”
Graphcore’s big bet on sparsity
Conditionally sparse models — sometimes called mixture of experts or routed models — only process data against the applicable parameters, something that can significantly reduce the compute resources needed to run them.
For example, if a language model was trained in all the languages on the internet, and then is asked a question in Russian, it wouldn’t make sense to run that data through the entire model, only the parameters related to the Russian language, explained Graphcore CTO Simon Knowles, in an interview with The Register.
“It’s completely obvious. This is how your brain works, and it’s also how an AI ought to work,” he said. “I’ve said this many times, but if an AI can do many things, it doesn’t need to access all of its knowledge to do one thing.”
- AI and ML could save the planet – or add more fuel to the climate fire
- Biotech firm: Graphcore IPUs faster for AI-based drug discovery than GPUs
- Megachips or decoupled approach? AI chip design companies accounting for operating costs
- Graphcore's AI chips may not be as powerful as Nvidia's GPUs, but may provide good bang for your buck
Knowles, who’s company builds accelerators tailored for these kinds of models, unsurprisingly believes they’re the future of AI. “I’d be surprised if, by next year, anyone is building dense-language models,” he added.
- AI chatbot trained on posts from web sewer 4chan behaved badly – just like human members
- OpenAI's DALL·E 2 generates AI images that are sometimes biased or NSFW
- Quantum-tunneling memory could boost AI energy efficiency by 100x
- Meta releases code for massive language model to AI researchers
HBM-2 pricey? Cache in on DDR instead
Sparse language models aren’t without their challenges. One of the most pressing, according to Knowles, has to do with the memory. The HBM used in high-end GPUs to achieve the necessary bandwidth and capacities required by these models is expensive and attached to an even more expensive accelerator.
This isn’t an issue for dense-language models where you might need all of that compute and memory, but it poses a problem for sparse models, which favor memory over compute, he explained.
Interconnect tech, like Nvidia’s NVLink, can be used to pool memory across multiple GPUs, but if the model doesn’t require all that compute, the GPUs could be left sitting idle. “It’s a really expensive way to buy memory,” Knowles said.
Graphcore’s accelerators attempt to sidestep this challenge by borrowing a technique as old as computing itself: caching. Each IPU features a relatively large SRAM cache — 1GB — to satiate the bandwidth requirements of these models, while raw capacity is achieved using large pools of inexpensive DDR4 memory.
“The more SRAM you've got, the less DRAM bandwidth you need, and this is what allows us to not use HBM,” Knowles said.
By decoupling memory from the accelerator, it’s far less expensive — the cost of a few commodity DDR modules — for enterprises to support larger AI models.
In addition to supporting cheaper memory, Knowles claims the company’s IPUs also have an architectural advantage over GPUs, at least when it comes to sparse models.
Instead of running on a small number of large matrix multipliers — like you find in a tensor processing unit — Graphcore’s chips feature a large number of smaller matrix math units that can address the memory independently.
This provides greater granularity for sparse models, where “you need the freedom to fetch relevant subsets, and the smaller the unit you’re obliged to fetch, the more freedom you have,” he explained.
The verdict is still out
Put together, Knowles argues this approach enables its IPUs to train large AI/ML models with hundreds of billions or even trillions of parameters, at substantially lower cost compared to GPUs.
However, the enterprise AI market is still in its infancy, and Graphcore faces stiff competition in this space from larger, more established rivals.
So while development on ultra-sparse, cut-rate language models for AI are unlikely to abate anytime soon, it remains to be seen whether it’ll be Graphcore’s IPUs or someone else’s accelerator that ends up powering enterprise AI workloads. ®