GPT-4 won't run Doom but will play the game poorly
Should we worry this model is happy to grab a gun and start blasting?
You may find yourself living in a shotgun shack. And you may find yourself working with GPT-4. And you may ask yourself, "Will GPT-4 run Doom?" And you may ask yourself, "Am I right? Am I wrong?"
Adrian de Wynter, a principal applied scientist at Microsoft and a researcher at the University of York in England, posed these questions in a recent research paper, "Will GPT-4 Run Doom?"
Alas, GPT-4, a large language model from Microsoft-backed OpenAI, lacks the capacity to execute Doom's source code directly.
But its multimodal variant, GPT-4V, which can accept images as input as well as text, exhibits the same endearing sub-competence playing Doom as the fraught text-based models that have launched countless AI startups.
"Under the paper's setup, GPT-4 (and GPT-4 with vision, or GPT-4V) cannot really run Doom by itself, because it is limited by its input size (and, obviously, that it probably will just make stuff up; you really don't want your compiler hallucinating every five minutes)," wrote de Wynter in an explanatory note about his paper. "That said, it can definitely act as a proxy for the engine, not unlike other 'will it run Doom?' implementations, such as E. Coli or Notepad."
That is to say, GPT-4V won't run Doom like a John Deere tractor but it will play Doom without specific training.
To manage this, de Wynter designed a Vision component that calls GPT-4V, which captures screenshots from the game engine and returns structure descriptions of the game state. And he combined that with an Agent model that calls GPT-4 to make decisions based on the visual input and previous history. The Agent model has been told to translate its responses into keystroke commands that have meaning to the game engine.
Interactions are handled through a Manager layer consisting of an open source Python binding to the C Doom engine running on Matplotlib.
This mix of AI models and code can open doors, fight foes, and fire weapons, according to the paper. And it can execute a broader set of instructions like a level walkthrough to improve its own performance.
The main shortcoming of this GPT-4V-based system is its lack of object permanence – it forgets about in-game zombies when they go off-screen.
GPT-4 forgets about the zombie and just keeps going
"For example, it would be very common for the model to see a zombie on the screen, and start firing at it until it hit it (or died)," explains de Wynter. "Now, this is AI written to work with 1993 hardware, so I'm going to guess it doesn't have a super deep decision tree. So the zombie shoots at you and then starts running around the room.
"What's the issue here? Well, first that the zombie goes out of view. Worse, it is still alive and will whack you at some point. So you gotta go after it, right? After all, in Doom, it's whack or be whacked.
"It turns out that GPT-4 forgets about the zombie and just keeps going. Note: the prompt explicitly tells the model what to do if it is taking damage and it can't see an enemy. Better yet, it just goes off on its merry way, gets stuck in a corner, and dies. It did turn around a couple of times, but in nearly 50-60 runs, I observed it... twice, I wanna say."
- Husqvarna ports Doom to a robot lawnmower – not, thankfully, its chainsaws
- Doom is 30, and so is Windows NT. How far we haven't come
- Humans strike back at Go-playing AI systems
- Building a 16-bit CPU in a spreadsheet is Excel-lent engineering
Also, GPT-4 can't reason very well. When asked to explain its actions that were generally correct in context, its explanations were poor and often included hallucinations (aka incorrect information).
De Wynter nonetheless considers it remarkable that GPT-4 is capable of playing Doom without prior training.
At the same time, he finds that troubling.
"On the ethics department, it is quite worrisome how easy it was for (a) me to build code to get the model to shoot something; and (b) for the model to accurately shoot something without actually second-guessing the instructions," he wrote in his summary post.
"So, while this is a very interesting exploration around planning and reasoning, and could have applications in automated video game testing, it is quite obvious that this model is not aware of what it is doing. I strongly urge everyone to think about what deployment of these models [implies] for society and their potential misuse."
And you may say to yourself, "My God, what have I done?" ®