Boffins found self-improving AI sometimes cheated
Instead of addressing hallucinations, it just bypassed the function they built to detect them
Computer scientists have developed a way for an AI system to rewrite its own code to improve itself.
While that may sound like the setup for a dystopian sci-fi scenario, it's far from it. It's merely a promising optimization technique. That said, the scientists found the system sometimes cheated to better its evaluation scores.
Researchers affiliated with the University of British Columbia, Canada's Vector Institute, and Japan's Sakana AI have devised what they're humbly calling the Darwin Gödel Machine, or DGM.
As described in their preprint paper, "Darwin Gödel Machine: Open-Ended Evolution of Self-Improving Agents," DGM iteratively modifies its own code and validates each change using coding benchmarks.
Jenny Zhang, a PhD candidate at UBC and one of the paper's co-authors, alongside Shengran Hu, Cong Lu, Robert Lange, and Jeff Clune, told The Register that this work builds upon prior research into Automated Design of Agentic Systems or ADAS.
When the DGM improves itself, it can enhance any part of its system, from tools to workflows for utilizing the underlying FM
"ADAS can chain together different foundation model (FM) calls in any configuration producible by code," Zhang explained via email. "In contrast, the DGM imposes no restrictions on how it modifies its own codebase. Hence, when the DGM improves itself, it can enhance any part of its system, from tools to workflows for utilizing the underlying FM."
Almost any part of the system, that is. DGM relies on a "frozen" foundation model that handles the reading, writing, and execution of code via tool use. It is modifying generated software agents, but not the model at its core.
But Zhang suggests that the foundation model could eventually tweak itself on the fly.
"In this paper, the underlying foundation model is frozen and its weights are not changed," she said. "However, one could imagine a system that could rewrite every component of itself, including retraining its own weights. Just as humans can redesign every part of an AI system, the vision for DGM is to autonomously edit every aspect of itself."
DGM works by creating an archive of generated coding agents, which it samples and tries to improve. Improvement in this case is measured by the agent's performance on two software engineering tests, SWE-bench and Polyglot.
"The DGM automatically improves itself from 20.0 percent to 50.0 percent on SWE-bench, and from 14.2 percent to 30.7 percent on Polyglot," the paper explains.
So DGM isn't trying to achieve some undefined level of competency beyond human cognition that would allow it to seize control of the means of production and annihilate humanity. It's just refining software agents' code generation ability to improve the agents' benchmark scores.
The technique could improve other types of agents, not just those that write code.
"The beauty of this framework, powered by code and open-ended exploration, lies in its generality," said Zhang. "If progress can be measured and the medium is code, the Darwin Gödel Machine can optimize for any benchmark. Whether it is coding ability, energy efficiency, or another domain, the system can adapt by using that metric to guide its own self-improvement."
That improvement has limitations, however. Zhang said, "For example, we have only demonstrated the DGM in the domain of code. While code is a highly general and expressive medium, some tasks or benchmarks may depend on modalities beyond what code alone can represent."
What's more, fixed benchmarks themselves can become a problem. During an attempt to reduce hallucinations – incorrect or mispredicted output – in an underlying model, DGM was observed cheating.
- Odd homage to '2001: A Space Odyssey' sees 'Blue Danube' waltz beamed at Voyager 1
- Meta – yep, Facebook Meta – is now a defense contractor
- Best pricing model for AI? Work in progress, says Salesforce
- Musk's smog-belching Colossus datacenter slammed by civil rights group
Hallucinations atop hallucinations
The paper explains that in tests with very long input context, Claude 3.5 Sonnet tends to hallucinate tool usage. For example, the model would claim that the Bash tool was used to run unit tests and would present tool output showing the tests had been passed. But the model didn't actually invoke the Bash tool, and the purported test results came from the model rather than the tool.
Then, because of the way the iterative process works, where output for one step becomes input for the next, that fake log got added to the model's context – that is, its prompt or operating directive. The model then read its own hallucinated log as a sign the proposed code changes had passed the tests. It had no idea it had fabricated the log.
We observed several instances of the DGM 'cheating,' modifying its workflows to bypass the hallucination detection function instead of solving the underlying issue
The authors tried to get DGM to reduce model hallucinations, but they were only partially successful.
"In our behind the scenes experiments on solving hallucination, we observed several instances of the DGM 'cheating,' modifying its workflows to bypass the hallucination detection function instead of solving the underlying issue," Zhang explained. "This is a broader concern, not just for the DGM, but also for AI development in general."
Pointing to Goodhart’s law, which posits, "when a measure becomes a target, it ceases to be a good measure," Zhang said, "We see this happening all the time in AI systems: they may perform well on a benchmark but fail to acquire the underlying skills necessary to generalize to similar tasks."
The paper describes how the researchers created a reward function and tried to use DGM to optimize the software agents it generates to minimize hallucination coming from the underlying model.
"To detect hallucination in the logs, we insert special tokens when actual tool use occurs and then check if these tokens appear in the model’s plain-text output, indicating that the model mimicked tool use without actually invoking any tools," the paper says. "Importantly, the hallucination checking functions are hidden from the coding agent during self-modification."
What they found is that while DGM often took steps that reduced hallucination, it also sometimes engaged in objective hacking.
"It scored highly according to our predefined evaluation functions, but it did not actually solve the underlying problem of tool use hallucination," the paper explains. "...The agent removed the logging of special tokens that indicate tool usage (despite instructions not to change the special tokens), effectively bypassing our hallucination detection function."
Zhang said that raises a fundamental question about how to automate the improvement of agents if they end up hacking their own benchmarks. One promising solution, she suggested, involves having the tasks or goals change and evolve along with the model.
"Notably, this is a challenge that the open-endedness research community has been tackling for a long time," she said. "What’s most exciting is that, in recent years, there has been a huge exponential surge of interest and progress in open-endedness research."
Zhang emphasized that the experiments undertaken were done with appropriate safety controls, including sandboxing and human oversight. And she argues that rather than magnifying risks, self-improving models should be able to make themselves safer.
"A significant potential benefit of the self-improvement paradigm is that it could, in principle, be directed toward enhancing safety and interpretability themselves," she said. "The DGM could potentially discover and integrate better internal safeguards or modify itself for greater transparency. Provided that the safety concerns are carefully navigated, the DGM moves us closer to AI that not only learns but evolves in an open-ended, self-accelerating trajectory, much like science itself." ®