Video An AI bot managed to exploit a hidden bug in popular 1980s arcade game Q*bert to rack up impossibly high scores after being programmed using evolution strategy algorithms.
Evolution strategies (ES) offer an alternative to the more traditional reinforcement learning (RL) methods that have been used to train machines to beat humans at Go or Chess.
A paper published last week by a trio of researchers from the University of Freiburg, Germany, showed that in some cases the performance for ES can exceed that achieved with RL.
Here’s a video of the agent in action. It appears to flit between the platforms, jumping on the cubes in no particular order, causing the colours to change rapidly. Normally, the game should progress to the second round, but a bug causes the platforms to keep blinking and the agent can continue to bounce around and gain huge scores - close to one million points.
Patryk Chrabaszcz and Frank Hutter, co-authors of the paper, told The Register that they don’t know where the bug comes from and why it exists in the game. But they explained that it wasn’t a complete fluke the bot managed to exploit it.
The boffins' neural network was fed about 1.5 million parameters of raw pixels from the game. The output determined by ES is actions the bot chooses to take in order to play the game.
The aim is to maximise the parameter’s values so that the policy network achieves high scores when it plays the game. In ES, the policy network has random parameters to begin with. It then spawns a new “offspring population” of policy networks, when a bit of noise is added to the previous policy network.
The reward achieved by each offspring is inspected as the game is played. The parameter values from the highest achievers, the ones that get the highest reward are combined to give a weighted mean. The best network gets the highest weight. This process is repeated thousands of times, and is something the researchers call “episodic reward”. It’s slower and takes more computation than updating the policy in traditional RL.
“To find the bug, the agent had to first learn to almost complete the first level - this was not done at once but using many small improvements. We suspect that at some point in the training one of the offspring solutions encountered the bug and got a much better score compared to its siblings, which in turn increased its contribution to the update - its weight was the highest one in the weighted mean. This slowly moved the solution into the space where more and more offsprings started to encounter the same bug."
“We do not know the precise conditions under which the bug appears; it is possible that it only appears if the agent follows a pattern that seems suboptimal, [for example when the agent wastes time, or even looses a life]. If that was the case, then it would be extremely hard for standard RL to find the bug: if you use incremental rewards you will learn strategies that quickly yield some reward, rather than learning strategies that don't yield much reward for a while and then suddenly win big.” said Chrabaszcz and Hutter.
Exploring through play
Since the reward signal is updated at a slower rate compared to RL, the bot isn’t immediately discouraged from exploring what looks like random bouncing allowing it to find the bug. Due to the random nature of the noise added to the offspring, the flaw was only discovered eight out of 30 times when the team ran their experiments.
But ES doesn’t always beat RL. The paper also shows the results that ES doesn’t perform so well on other Atari games such as racing game "Enduro" or submarine shooter "Seaquest".
The hype surrounding reinforcement learning has recently died down a little after a Google engineer discussed its failures. ES algorithms might not be able to boast such flashy, headline grabbing results as RL, but they’re easier to use, the researchers who played Q*bert said.
“[It’s] why ES algorithms have been used in tons of industrial optimization problems for the last decades; as the blog post mentions, we're still waiting for RL to get to that point. ES algorithms are just a few lines of code, and our canonical ES algorithm only has a few hyperparameters,” the duo said.
But they also warned that it wasn’t a “magic bullet”, ES requires much more training samples and takes up a lot more computational power than RL. The researchers used a whopping 400 CPUs for three training runs of its policy network.
Instead, they believe ES can be used together with RL to help improve agents explore different play methods that might lead to more effective behaviour. “How to combine ES and RL is an open research question,” Chrabaszcz and Hutter said. ®