Mistral Large 2 leaps out as a leaner, meaner rival to GPT-4-class AI models
It's not the size that matters, it's how you use it
Mistral AI on Wednesday revealed a 123-billion-parameter large language model (LLM) called Mistral Large 2 (ML2) which, it claims, comes within spitting distance of the top models from OpenAI, Anthropic, and Meta.
The news comes a day after Meta launched the hotly anticipated 405-billion-parameter variant of Llama 3 with a 128,000 token context window – think of this as the model's short-term memory – and support for eight languages.
ML2 boasts many of these same qualities – including the 128,000 token context window, support for "dozens" of languages, and more than 80 coding languages. Language support has been one of Mistral's biggest differentiators compared to other open models – which are often English-only – and ML2 continues this trend.
If Mistral's benchmarks are to be believed, ML2 trades blows with OpenAI's GPT-4o, Anthropic's Claude 3.5 Sonnet, Meta's Llama 3.1 405B, and others across a number of language, coding, and mathematics tests.
For instance, in the popular Massive Multitask Language Understanding (MMLU) benchmark, the French model builder's latest LLM achieves a score of 84 percent. By comparison, just yesterday Meta revealed Llama 3.1 405B achieved a score of 88.6 percent while GPT-4o and Claude 3.5 Sonnet manage scores of 88.7 and 88.3 percent, respectively. Scientists estimate that domain experts – the human kind – would score in the neighborhood of 89.8 percent on the bench.
Large but not too large
While impressive in its own right, the more important factor is that ML2 manages to achieve this level of performance using a fraction of the resources of competing models. ML2 is less than a third the size of Meta's biggest model and roughly one fourteenth the magnitude of GPT-4.
This has major implications for deployment, and will no doubt make ML2 a very attractive model for commercial applications. At the full 16-bit precision at which it was trained, the 123-billion-parameter model requires about 246GB of memory. For now, that's still too large to fit on a single GPU or accelerator from Nvidia, AMD, or Intel – but it could easily be deployed on a single server with four or eight GPUs without resorting to quantization.
- Meta claims 'world's largest' open AI model with Llama 3.1 405B debut
- OpenAI's GPT-4o Mini is indeed small – like its lead over rivals in certain tests
- AMD claims Nvidia's Grace CPU Superchip, Arm are no match for its Epyc Zen 4 cores
- What does Google Gemini do with your data? Well, it's complicated...
The same can't necessarily be said of GPT-4, presumably Claude 3.5 Sonnet, or Meta's Llama 3.1 405B. In fact, as we discussed earlier this week, Meta opted to provide an 8-bit quantized version of the 3.1 model so it could run on existing HGX A100 and H100 systems. You can learn more about quantization in our hands-on guide here – in a nutshell, it's a compression method that trades model precision for memory and bandwidth savings.
But, as Mistral is keen to point out, ML2's smaller footprint also means it can achieve much higher throughput. This is because LLM performance, often measured in tokens per second, is dictated in large part by memory bandwidth. In general, for any given system, smaller models will produce responses to queries faster than larger ones, because they put less pressure on the memory subsystem.
If you happen to have a beefy enough system, you can try Mistral Large 2 for yourself by following our guide to running LLMs at home.
Prioritizing accuracy and concision
In its launch announcement, Mistral highlighted the model builder's efforts to combat hallucinations – where the model generates convincing but factually inaccurate information.
This included fine-tuning the model to be more "cautious and discerning" about how it responds to requests. Mistral also explained the model was trained to recognize when it doesn't know something, or if it has insufficient information to answer – there's perhaps a lesson in that for all of us. Mistral also contends that ML2 should be much better than past models at following complex instructions, especially in longer conversations.
This is good news, as one of the main ways in which people interact with LLMs is through prompts that dictate how the model should respond or behave in plain language. You can find an example of that in our recent AI containerization guide, in which we coax Microsoft's Phi 3 Mini into acting like a TV weather personality.
Additionally, Mistral claims ML2 has been optimized to generate succinct responses wherever possible. While it notes that long-form responses can result in higher scores in some benchmarks, they aren't always desirable in business contexts – they tend to tie up the compute for longer, resulting in higher operational costs.
Open-ish
While ML2 is open – in the sense it's freely available on popular repositories like Hugging Face – the model's license is more restrictive than many of Mistral's past models.
For instance, the recently released Mistral-NeMo-12B model, which was developed in collaboration with Nvidia, bore an open source Apache 2 license.
ML2 on the other hand bears the far less permissive Mistral Research License [Markdown], which allows for use in non-commercial and research capacities, but requires a separate commercial license if you want to put it to work in a business setting.
Considering the amount of computational horse power required to train, fine tune, and validate larger models, this isn't all that surprising. It also isn't the first time we've seen model builders give away smaller models under common open source licenses only to restrict their larger ones. Alibaba's Qwen2 model, for instance, is licensed under Apache 2 with the exception of the 72B variant, which used its own Qianwen license. ®