Google trains a GenAI model to simulate Doom's game engine in real-ish time

The proof of concept shows promise despite big limitations

A team from Google and Tel Aviv University have developed a generative AI game engine capable of simulating the cult classic Doom at more than 20 frames per second because research.

The work, detailed in a paper published [PDF] yesterday, demonstrates how reinforcement and diffusion models can be used to simulate game engines in real time.

Dubbed GameNGen, pronounced "game engine," the model was trained on Doom, but researchers note that nothing about the approach used is specific to that game, and could be applied to any number of titles.

Traditional game engines are coded manually to follow a set loop that tracks user inputs, updates the game state, and renders pixels on the screen. Do this fast enough and it creates the illusion that you're moving through and interacting with a virtual environment.

Youtube Video

By comparison, GameNGen works a bit differently as the entire game engine and frames are generated on the fly based on the player's action and the past few frames. The levels are imagined or recalled by the model.

To do this, you might think the researchers mined hours of game footage from actual players; but according to researchers, this wasn't practical.

Instead, the first phase of GameNGen's training was to create a reinforcement learning agent that learned to play Doom. The data generated by these training sessions was used to train a custom diffusion model based on Stable Diffusion v1.4, which renders the game.

According to the researchers, running on a single TPU v5, GameNGen was able to achieve around 20 FPS. While that's far from the 60-plus FPS target considered acceptable for most modern first-person shooter games, it's worth noting that the OG Doom maxed at 35 FPS anyway.

The researchers note that faster performance was actually possible, up to 50 FPS, when dropping down to a single denoising step, but they noted that the quality suffered as a result.

In terms of visual quality, the boffins claim the generated frames are comparable to lossy JPEG compression, and that "human raters are only slightly better than random chance at distinguishing short clips of the game from clips of the simulation." We've embedded the video for you to judge for yourselves, but it's worth noting that those "short clips" only amounted to 1.6 to 3.2 seconds of gameplay.

As you might expect, GameNGen is really a proof of concept at this point and suffers from numerous limitations as highlighted in the paper. Among the biggest comes down to memory. Running on a single TPU v5, the model only has enough room to store about three seconds of gameplay.

Anything longer than and it will forget what happened, and how the imagined level is laid out, like it's progressing through a constantly shifting dream.

The fact alone that game logic can function at all despite this limitation is "remarkable" in the words of the researchers.

Another limitation highlighted in the text is that relying on reinforcement learning agents as a source of training data means that not every corner of the original game was mapped. "Our agent, even at the end of training, still does not explore all the game locations and interactions, leading to erroneous behavior in those cases." ®

More about

TIP US OFF

Send us news


Other stories you might like