DeepSeek's not the only Chinese LLM maker OpenAI and pals have to worry about. Right, Alibaba?
Qwen 2.5 Max tops both DS V3 and GPT-4o, cloud giant claims
Analysis The speed and efficiency at which DeepSeek claims to be training large language models (LLMs) competitive with America's best has been a reality check for Silicon Valley. However, the startup isn't the only Chinese model builder the US has to worry about.
This week Chinese cloud and e-commerce goliath Alibaba unveiled a flurry of LLMs including what appears to be a new frontier model called Qwen 2.5 Max that it reckons not only outperforms DeepSeek's V3, which the reasoning-capable R1 is based on, but trounces America's top models.
As always, we recommend taking benchmarks with a grain of salt, but if Alibaba is to be believed, Qwen 2.5 Max – which can search the web, and output text, video, and images from inputs – managed to out perform OpenAI's GPT-4o, Anthropic's Claude 3.5 Sonnet, and Meta's Llama 3.1 405B across the popular Arena-Hard, MMLU-Pro, GPQA-Diamond, LiveCodeBench, and LiveBench benchmark suites.
Given the fervor around DeepSeek, we feel compelled to emphasize that Alibaba is drawing comparisons against V3 and not the R1 model that has the world abuzz. This might also explain the comparison to GPT-4o rather than OpenAI's flagship o1 models.
In any case, the announcement further fuels the perception that, despite ongoing efforts to stifle Chinese AI development by the West, the US lead in AI may not be as large as previously thought. And the perception that the countless billions upon billions of dollars demanded by Silicon Valley to develop artificial intelligence looks a little greedy.
Speeds and feeds or lack there of
Unfortunately, beyond performance claims, API access, and a web-based chatbot, Alibaba's Qwen team is being rather tight-lipped about its latest model release. Unlike DeepSeek, whose models are available to freely download and use if you don't want to rely on DeepSeek's apps or cloud, Alibaba has not released Qwen 2.5 Max. It's available to access from Alibaba's servers.
What we do know so far is Qwen 2.5 Max is a large-scale mixture of expert (MoE) model that was trained on a corpus of 20 trillion tokens before being further refined using supervised fine-tuning and reinforcement learning from human feedback.
As the name suggests, MoE models like the Mistral series and DeepSeek's V3 and R1 comprise several artificial experts, if you will, that have been trained to handle specific tasks, such as coding or math.
MoE models have become increasingly popular among model builders to decouple parameter count from actual performance. Because only a portion of the model is active for any given request – there's no need to activate the entire neural network to tackle a query, just the "expert" parts relevant to the question – it's now possible to increase parameter count without compromising throughput.
That is to say, rather than running an input query through the entire multi-billion-parameter network, performing all those calculations per token, only query-relevant layers are used, meaning outputs are generated faster.
At this point, Alibaba hasn't disclosed just how big Qwen 2.5 Max is. However we do know the previous Qwen Max model was around 100 billion parameters in size.
The Register reached out to Alibaba for comment; we'll let you know if we hear back. In the meantime we asked Qwen 2.5 Max, via its online chatbot form, to share its specs, and it doesn't appear to know much about itself either. But even if it did spit out a number, we're not sure we'd believe it.
Performance at what cost
Unlike many previous Qwen models, we may never get hold of Qwen 2.5 Max's neural network weights. On the Alibaba Cloud website, the model is listed as being proprietary, which might explain why the Chinese super-corp is sharing so little about the model.
Not disclosing parameter counts and other key details is par for the course for many model builders, including Alibaba has been similarly tight-lipped with regard to its proprietary Qwen Turbo and Qwen Plus models.
The lack of details makes evaluating model performance somewhat challenging as performance has to be weighted against cost. A model may out perform another in benchmarks, but if it costs 3-4x more to run, it may not be worth the hassle. This certainly appears to be the case with Qwen 2.5 Max.
For the moment, Alibaba's website has API access to the model listed at $10 per million input tokens and $30 for every million tokens generated. Compare that to GPT-4o, for which OpenAI is charging $2.50 per million input tokens and $10 per million output tokens, or half that if you opt for its batch processing.
With that said, Qwen 2.5 Max is still cheaper than OpenAI's flagship o1 model which will run you $15 per million input tokens and $60 per million output tokens generated.
A growing family
As mentioned, Alibaba's latest Qwen model is only the latest in a string of LLMs released by the Chinese mega-biz since 2023. Its most recent generation of models, which bear the Qwen 2.5 name, began trickling out in September, with Alibaba openly releasing weights for its 0.5, 1.5, 3, 7, 14, 32, and 72-billion-parameter versions.
Pit against its contemporaries, Alibaba claimed the largest of these models could go toe-to-toe and in some cases best Meta's far larger 405B Llama model. But again, we recommend taking these claims with a grain of salt here.
Alongside its general-purpose models, Alibaba also released the weights for several math and code-optimized LLMs and extended access to a pair of proprietary models called Qwen Plus and Qwen Turbo, which boasted alleged performance within spitting distance of GPT-4o and GPT-4o mini.
In December, it detailed its OpenAI o1 style "thinking" model called QwQ. And then this week, leading up to the Qwen 2.5 Max launch, the cloud provider announced a trio of open vision language models (VLMs) weighing in at 3, 7, and 72-billion-parameters in size. Alibaba contends the largest of these models are competitive with the likes of Google's Gemini 2, OpenAI's GPT-4o, and Anthropic's Claude 3.5 Sonnet, at least in vision benchmarks anyway.
If that weren't enough, this week also saw Alibaba roll out upgraded versions of its 7 and 14-billion-parameter Qwen 2.5 models, which boost their context window — essentially their short term memory — to a million tokens.
Longer context windows can be particularly useful for retrieval augmented generation, aka RAG, enabling models to parse larger quantities of information from documents without getting lost.
- China's DeepSeek just dropped a free challenger to OpenAI's o1 – here's how to use it on your PC
- What happens when we can't just build bigger AI datacenters anymore?
- US AI shares battered, bruised, and holding after yesterday's DeepSeek beating
- DeepSeek isn't done yet with OpenAI – image-maker Janus Pro is gunning for DALL-E 3
Questions and concerns remain
But for all the hype Chinese model builders have enjoyed and market volatility they've caused over the past week, questions and concerns over censorship and privacy persist.
As we pointed out with DeepSeek, user data collected by its online services will be storied in China, per its privacy policy. It's a similar story with Alibaba's Qwen Chat, which may store data in either its Singapore or Chinese datacenters.
This might be a major concern for some, but for others it poses a legitimate risk. Posting on X earlier this week, OpenAI API dev Steve Heidel quipped, "Americans sure love giving their data away to the CCP in exchange for free stuff."
Concerns have also been raised about censorship of controversial topics that may paint the Beijing regime in an unfavorable light. Just as we've seen with previous Chinese models, both DeepSeek and Alibaba will leave out information on sensitive topics, stop generation prematurely, or outright refuse to answer questions regarding topics like the Tiananmen Square massacre or the political status of Taiwan. ®