SambaNova makes Llama gallop in inference cloud debut
AI infra startup serves up Llama 3.1 405B at 100+ tokens per second
Not to be outdone by rival AI systems upstarts, SambaNova has launched inference cloud of its own that it says is ready to serve up Meta’s largest models faster than the rest.
The cloud offering is one of several which have cropped up amid the AI boom, offering API access to popular open-weight models. Most of these are GPU-based, but for the more boutique vendors dealing in specialized hardware, like Cerebras, Groq, and now SambaNova, it seems whoever can get the largest model to spit out tokens the fastest has a leg up.
If you're not familiar, tokens here refer to how large language models encode words, word fragments, punctuation, and figures. So, the faster your infrastructure can generate tokens, the less time you're left waiting for a response.
According to CEO Rodrigo Liang, SambaNova has managed to get Meta’s 405 billion parameter Llama 3.1 model (more than twice the size of OpenAI's GPT-3.5 model) to churn out tokens at a rate of 132 per second and at the full 16-bit precision it was trained at no less.
To put that in perspective, its estimated the average person can read at about 5 words per second. At 132 tokens a second, SambaNova's system is nearly twice as fasts as the next fastest GPU systems at least according to Artificial Analysis data cited in SambaNova's announcement.
SambaNova's SN40L-based systems come in nearly twice as fast as competing platforms, according to data from Artificial Analysis. - Click to enlarge
Pedal to the metal
Introduced earlier this summer, Llama 3.1 405B is Meta's first frontier-class model capable of going toe-to-toe with much larger models from the likes of OpenAI, Anthropic, and Google.
And while far smaller than competing models, running 405B at 16-bit precision isn't an easy feat, as simply fitting it into memory requires 810 GB of capacity. That's not even counting the space required by the key-value cache.
To run the model, SambaNova used 16 of its SN40L accelerators, each with 64 GB of speedy HBM3 memory and 520 MB of on-die SRAM. You can find a full breakdown of the chip, codenamed Cerulean 1, on our sibling site The Next Platform.
Using this configuration, SambaNova boasts it's achieved a throughput of 132 tokens per second in 405B and 461 tokens a second when running the smaller 70 billion parameter variant. By comparison, data from Artificial Analysis shows that even the best GPU-based systems can only managed to serve Meta's 405B model at 72 tokens per second with most much slower than that.
What's more, the startup claims it's able to maintain performance in excess of 100 tokens per second up to a batch size of four. Or, in other words, for up to four simultaneous requests. According to Anton McGonnell, head of SambaNova's software products division, there may be some additional headroom to scale that even further.
This level of performance is possible in part thanks to the SN40L's larger caches, McGonnell told the Register. This, he added, allows it to avoid the performance overheads commonly seen in multi-GPU systems.
"If GPUs could truly utilize their memory bandwidth, they will be much faster, but they can't," he explained.
But, while SambaNova was able to get Llama 3 405B running at 16-bit precision, it wasn't without compromise. One of the biggest concessions is the model isn't running at its full 128k token context window and was instead cut back to 8k.
"For the purposes of launch, we're just making the 8k version available, if only because of traffic," McGonnell said. "If people start using 128k, then it slows everything down for everybody else."
While this is unlikely to negatively impact performance in something like a customer service chatbot, it will limit the service's practicality for longer-context applications like document summarization.
- Mainframes aren't dead, they're just learning AI tricks
- We're in the brute force phase of AI – once it ends, demand for GPUs will too
- DoE drops $23M in effort to reinvigorate supercomputing
- Cerebras gives waferscale chips inferencing twist, claims 1,800 token per sec generation rates
The competition heats up
SambaNova Cloud's free and paid enterprise tiers are available starting today. The infrastructure provider also plans to roll out a developer tier later this year which, in addition to higher rate limits, will let devs build models based on Llama 3.1.
However, as we mentioned earlier, SambaNova is far from the only infrastructure vendor leaning on speed to differentiate itself from a sea of GPU-based offerings. Cerebras, which announced its own inference cloud at the Hot Chips conference late last month, already boasts performance of up to 450 tokens per second in Llama 3.1 70B and anticipates it will be able to achieve 350 tokens per second when running the 405B variant. If Cerebras can actually pull that off, it'll put the company well ahead of SambaNova, even if doing so will require 12 of its wafer-scale chips.
There's also Groq, which has previously managed to achieve throughputs of 300 tokens a second in Llama 2 70B using some 576 of its language processing units. The firm recently nabbed $640 million in a series-D funding round, which among other things will help it ramp up the development of its next-gen accelerators. ®