Meta debuts its first 'mixture of experts' models from the Llama 4 herd

Says they’re done right as they don’t lean so far left

Meta has debuted the first two models in its Llama 4 family, its first to use mixture of experts tech.

A Saturday post from the social media giant announced the release of two models:

  • Llama 4 Scout, which has 109 billion parameters, 17 billion of which are active across its 16 experts at any given time. Meta says it can fit on a single Nvidia H100 GPU — though doing so will require heavy quantization and even then you won't be able to take advantage of its 10 million token context window;
  • Llama 4 Maverick, is larger with 128 experts totaling 402 billion parameters, though like Scout only 17 billion are active for any given query.

Mixture of Experts (MoE) is an approach to machine learning that divides a task into several smaller jobs and assigns each to a neural network subsystem tuned to solving that sort of problem. Each expert solves its own part of a problem, and that work is combined into a single response. DeepSeek-V3 is a MoE model. So is Mistral.ai's Mixtral 8x7B. OpenAI has neither confirmed nor denied it already uses MoE but has hinted doing so is in its future plans because the approach is widely felt to produce better output with fewer resources.

Scout and Maverick are based on Llama 4 Behemoth, which is still in training. Meta says it's based on 288B active parameters, 16 experts, and nearly two trillion total parameters.

Meta's post states the pre-training process for the Behemoth is being done using FP8 and 32K GPUs, and that its crew had achieved 390 TFLOPs/GPU while doing so. “The overall data mixture for training consisted of more than 30 trillion tokens, which is more than double the Llama 3 pre-training mixture and includes diverse text, image, and video datasets,” the post adds.

Zuck's AI squad also claimed it has developed "a new training technique which we refer to as MetaP that allows us to reliably set critical model hyper-parameters such as per-layer learning rates and initialization scales."

That, or so Meta claims, has made it possible for Llama 4 to enable "open source fine-tuning efforts by pre-training on 200 languages, including over 100 with over 1 billion tokens each, and overall 10x more multilingual tokens than Llama 3."

Meta's post doesn't detail the corpus used to train the Llama 4 models, an important issue given the company is accused of using pirated content to train its models.

Doing it right

The post announcing the Llama 4 models includes multiple benchmark results that mostly show Meta's models outperform rival on myriad metrics.

Meta also claimed it has fixed the tendency for large language models to deliver results that align with left wing political thought.

"It's well-known that all leading LLMs have had issues with bias — specifically, they historically have leaned left when it comes to debated political and social topics," Meta's launch post states, before attributing that to "the types of training data available on the internet."

Meta has therefore made Llama 4 models "more responsive so that it answers questions, can respond to a variety of different viewpoints without passing judgment, and doesn't favor some views over others."

That translates into Llama has become "dramatically more balanced with which prompts it refuses to respond to" and "responds with strong political lean at a rate comparable to [X AI's] Grok (and at half of the rate of Llama 3.3) on a contentious set of political or social topics."

Meta wants to "drive this rate further down" so that its LLMs rule fewer political topics out of bounds.

Meta has also claimed it did plenty of work to ensure these models produce safe output. One such initiative is modestly called the Generative Offensive Agent Testing (GOAT). Apparently GOAT improves on existing LLM red-teaming "by simulating multi-turn interactions of medium-skilled adversarial actors, helping us increase our testing coverage and raise vulnerabilities faster."

GOAT has "allowed our expert human red teamers to focus on more novel adversarial areas, while the automation focuses on known risk areas. This makes the process more efficient and effective, and it enables us to build a better quantitative and qualitative picture of risk."

You can put those claims to the test by downloading the new models from Meta or the Hugging Face model-mart.

They're available for download "In keeping with our commitment to open source" according to Meta’s announcement post. However the Open Source Initiative has claimed the Llama 4 Community Model is not open source because users in the European Union are denied some rights offered to users elsewhere. ®

More about

TIP US OFF

Send us news


Other stories you might like