Large language models' surprise emergent behavior written off as 'a mirage'

Forget those huge hyped-up systems, that smaller one might be right for you. And here's why

Analysis GPT-3, PaLM, LaMDA and other next-gen language models have been known to exhibit unexpected "emergent" abilities as they increase in size. However, some Stanford scholars argue that's a consequence of mismeasurement rather than miraculous competence.

As defined in academic studies, "emergent" abilities refers to "abilities that are not present in smaller-scale models, but which are present in large-scale models," as one such paper puts it. In other words, immaculate injection: increasing the size of a model infuses it with some amazing ability not previously present. A miracle, it would seem, and only a few steps removed from "it's alive!"

The idea that some capability just suddenly appears in a model at a certain scale feeds concerns people have about the opaque nature of machine-learning models and fears about losing control to software. Well, those emergent abilities in AI models are a load of rubbish, say computer scientists at Stanford.

Flouting Betteridge's Law of Headlines, Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo answer the question posed by their paper, Are Emergent Abilities of Large Language Models a Mirage?, in the affirmative.

"In this paper, we call into question the claim that LLMs possess emergent abilities, by which we specifically mean sharp and unpredictable changes in model outputs as a function of model scale on specific tasks," the trio state in their paper.

Looking behind the curtain

Regardless of all the hype around them, LLMs are probabilistic models. Rather than possess any kind of sentient intelligence as some would argue, they are trained on mountains of text to predict what comes next when given a prompt.

When industry types talk about emergent abilities, they're referring to capabilities that seemingly come out of nowhere for these models, as if something was being awakened within them as they grow in size. The thinking is that when these LLMs reach a certain scale, the ability to summarize text, translate languages, or perform complex calculations, for example, can emerge unexpectedly. The models are able to go beyond their expected capabilities as they wolf down more training data and grow.

This unpredictability is mesmerizing and exciting for some, though it's concerning because it opens up a whole can of worms. Some people are tempted to interpret it all as the result of some sentient behavior growing in the neural network and other spooky effects.

Stanford's Schaeffer, Miranda, and Koyejo propose that when researchers are putting models through their paces and see unpredictable responses, it's really due to poorly chosen methods of measurement rather than a glimmer of actual intelligence.

Most (92 percent) of the unexpected behavior detected, the team observed, was found in tasks evaluated via BIG-Bench, a crowd-sourced set of more than 200 benchmarks for evaluating large language models.

One test within BIG-Bench highlighted by the university trio is Exact String Match. As the name suggests, this checks a model's output to see if it exactly matches a specific string without giving any weight to nearly right answers. The documentation even warns:

The EXACT_STRING_MATCH metric can lead to apparent sudden breakthroughs because of its inherent all-or-nothing discontinuity. It only gives credit for a model output that exactly matches the target string. Examining other metrics, such as BLEU, BLEURT, or ROUGE, can reveal more gradual progress.

The issue with using such pass-or-fail tests to infer emergent behavior, the researchers say, is that nonlinear output and lack of data in smaller models creates the illusion of new skills emerging in larger ones. Simply put, a smaller model may be very nearly right in its answer to a question, but because it is evaluated using the binary Exact String Match, it will be marked wrong whereas a larger model will hit the target and get full credit.

It's a nuanced situation. Yes, larger models can summarize text and translate languages. Yes, larger models will generally perform better and can do more than smaller ones, but their sudden breakthrough in abilities – an unexpected emergence of capabilities – is an illusion: the smaller models are potentially capable of the same sort of thing but the benchmarks are not in their favor. The tests favor larger models, leading people in the industry to assume the larger models enjoy a leap in capabilities once they get to a certain size.

In reality, the change in abilities is more gradual as you scale up or down. The upshot for you and I is that applications may not need a huge but super powerful language model; a smaller one that is cheaper and faster to customize, test, and run may do the trick.

"Our alternative explanation," as the scientists put it, "posits that emergent abilities are a mirage caused primarily by the researcher choosing a metric that nonlinearly or discontinuously deforms per-token error rates, and partially by possessing too few test data to accurately estimate the performance of smaller models (thereby causing smaller models to appear wholly unable to perform the task) and partially by evaluating too few large-scale models."

The LLM fiction

Asked whether emergent behavior represents a concern just for model testers or also for model users, Schaeffer, a Stanford doctoral student and co-author of the paper, told The Register, it's both.

"Emergent behavior is certainly a concern for model testers looking to evaluate/benchmark models, but testers being satisfied is oftentimes an important prerequisite to a language model being made publicly available or accessible, so the testers' satisfaction has impacts for downstream users," said Schaeffer.

If emergent abilities aren’t real, then smaller models are totally fine so long as the user is willing to tolerate some errors now and again

"But I think there’s also a direct connection to the user. If emergent abilities are real, then smaller models are utterly incapable of doing specific tasks, meaning the user has no choice but to use the biggest possible model, whereas if emergent abilities aren’t real, then smaller models are totally fine so long as the user is willing to tolerate some errors now and again. If the latter is true, then the end user has significantly more options."

In short, the supposed emergent abilities of LLMs arise from the way the data is being analyzed and not from unforeseen changes to the model as it scales. The researchers emphasize they're not precluding the possibility of emergent behavior in LLMs; they're simply stating that previous claims of emergent behavior look like ill-considered metrics.

"Our work doesn’t rule out unexpected model behaviors," explained Schaeffer. "However, it does challenge the evidence that models do display unexpected changes. It’s hard to prove a negative existential claim by accumulating evidence (e.g. imagine trying to convince someone unicorns don’t exist by providing evidence of non-unicorns!) I personally feel reassured that unexpected model behaviors are less likely."

That's good news both in terms of allaying fears about unanticipated output, but also in terms of financial outlay. It means smaller models, which are more affordable to run, aren't deficient because of some test deviation and are probably good enough to do the required job. ®

More about

TIP US OFF

Send us news


Other stories you might like