Artificial intelligence... or advanced imitation? How DeepMind used YouTube vids to train game-beating Atari bot

I think I'm a clone now


Video DeepMind has taught artificially intelligent programs to play classic Atari computer games by making them watch YouTube videos.

Typically, for this sort of research, you'd use a technique called reinforcement learning. This is a popular approach in machine learning that trains bots to perform a specific task, such as playing computer games, by tempting them with lots of little rewards.

To do this, developers have to build algorithms and models that can figure out the state of the game’s environment, identify the rewards to obtain, and then go get 'em. By seeking out these prizes, the bots should gradually progress through the game world, step by step. The goodies should come thick and fast to continuously lure the AI through levels.

But a new method, developed by DeepMind eggheads and documented in a paper this week, teaches code to play classic Atari titles, such as Montezuma’s Revenge, Pitfall, and Private Eye, without any explicit environmental rewards. Instead, an agent is asked to copy the way humans tackle the games, by analyzing YouTube footage of their play-through sessions.

rat_maze_shutterstock

DeepMind: Get a load of our rat-like AI. 'Ere, look. It solves mazes and stuff

READ MORE

Exploration games like 1984's Montezuma’s Revenge are particularly difficult for AI to crack, because it's not obvious where you should go, which items you need and in which order, and where you should use them. That makes defining rewards difficult without spelling out exactly how to play the thing, and thus defeating the point of the exercise.

For example, Montezuma’s Revenge requires the agent to direct a cowboy-hat-wearing character, known as Panama Joe, through a series of rooms and scenarios to reach a treasure chamber in a temple, where all the goodies are hidden. Pocketing a golden key, your first crucial item, takes about 100 steps, and is equivalent to 10018 possible action sequences. That’s way too big for typical reinforcement learning algorithms to cope with – there are too many sequential steps for a neural network to internalize just to obtain a single specific reward.

These sorts of rewards are therefore described as sparse: each of the steps involved to obtain the reward appears to achieve very little, and there is little in the way of an immediate bounty to guide the bot, even though together the steps would lead the player to a goal. Games like Ms Pac-Man are the opposite, and provide software agents with near instant feedback: points are racked up as she guzzles pellets and fruit, and she is punished when she gets caught by ghosts. Sparse games – such as Montezuma’s Revenge and other puzzle adventures – require agents to have much more patience than reinforcement learning usually affords.

Imitation learning

One way to get around the sparse rewards problem is to directly learn from demonstrations. After all, it's how you and I learn things, too. “People learn many tasks, from knitting to dancing to playing games, by watching videos online,” the DeepMind team wrote in their paper's abstract.

"They demonstrate a remarkable ability to transfer knowledge from the online demonstrations to the task at hand, despite huge gaps in timing, visual appearance, sensing modalities, and body differences. This rich setup with abundant unlabeled data motivates a research agenda in AI, which could result in significant progress in third-person imitation, self-supervised learning, reinforcement learning (RL) and related areas."

To educate their code, the researchers chose three YouTube gameplay videos for each of the three titles: Montezuma’s Revenge, Pitfall, and Private Eye. Each game had its own agent, which had to map the actions and features of the title into a form it could understand. The team used two methods: temporal distance classification (TDC), and cross-modal temporal distance classification (CDC).

TDC taught an agent to predict the temporal distance, or difference between two frames. It learned to spot which visual features have changed between two video frames in the game, and what actions were taken in between. To generate training data, pairs of frames were chosen randomly from a given YouTube video of the game.

CDC is clever as it tracks sounds. The noises in the game correlate to actions, such as jumping or collecting items, and so it mapped these sounds to important game events. After these visual and audio features were extracted and embedded using neural networks, an agent could begin copying how humans played the game.

Here's the agent in action in Montezuma's Revenge. You can also see more footage of the computer software, trained to play Pitfall and Private Eye, here.

Youtube Video

The DeepMind code still relies on lots of small rewards, of a kind, although they are referred to as checkpoints. While playing the game, everything sixteenth video frame of the agent's session is taken as a snapshot and compared to a frame in a fourth video of a human playing the same game. If the agent’s game frame is close or matches the one in the human's video, it is rewarded. Over time, it imitates the way the game is played in the videos by carrying out a similar sequence of moves to match the checkpoint frame.

It’s a nifty trick, and the agent does reach pretty decent scores on all three games – exceeding average human players and other RL algorithms: Rainbow, ApeX, and DQfD. Crucially, it is learning to copy a person's actions, rather than master a game all by itself. It is seemingly reliant on having a good human trainer, just like we relied on good teachers at school.

deepmind_results

A table of the results for the AI agent playing the Atari games against average human scores and other RL algorithms. Image credit: Aytar et al.

Although impressive, it’s unknown how practical this all is. Can it be used for something else other than Atari games? The research is also probably pretty difficult to replicate. What hardware did the researchers use? How long did it take to train the agents? The paper doesn’t say, we asked DeepMind, and it declined to comment. ®

Similar topics

Broader topics


Other stories you might like

  • Despite global uncertainty, $500m hit doesn't rattle Nvidia execs
    CEO acknowledges impact of war, pandemic but says fundamentals ‘are really good’

    Nvidia is expecting a $500 million hit to its global datacenter and consumer business in the second quarter due to COVID lockdowns in China and Russia's invasion of Ukraine. Despite those and other macroeconomic concerns, executives are still optimistic about future prospects.

    "The full impact and duration of the war in Ukraine and COVID lockdowns in China is difficult to predict. However, the impact of our technology and our market opportunities remain unchanged," said Jensen Huang, Nvidia's CEO and co-founder, during the company's first-quarter earnings call.

    Those two statements might sound a little contradictory, including to some investors, particularly following the stock selloff yesterday after concerns over Russia and China prompted Nvidia to issue lower-than-expected guidance for second-quarter revenue.

    Continue reading
  • Another AI supercomputer from HPE: Champollion lands in France
    That's the second in a week following similar system in Munich also aimed at researchers

    HPE is lifting the lid on a new AI supercomputer – the second this week – aimed at building and training larger machine learning models to underpin research.

    Based at HPE's Center of Excellence in Grenoble, France, the new supercomputer is to be named Champollion after the French scholar who made advances in deciphering Egyptian hieroglyphs in the 19th century. It was built in partnership with Nvidia using AMD-based Apollo computer nodes fitted with Nvidia's A100 GPUs.

    Champollion brings together HPC and purpose-built AI technologies to train machine learning models at scale and unlock results faster, HPE said. HPE already provides HPC and AI resources from its Grenoble facilities for customers, and the broader research community to access, and said it plans to provide access to Champollion for scientists and engineers globally to accelerate testing of their AI models and research.

    Continue reading
  • Workday nearly doubles losses as waves of deals pushed back
    Figures disappoint analysts as SaaSy HR and finance application vendor navigates economic uncertainty

    HR and finance application vendor Workday's CEO, Aneel Bhusri, confirmed deal wins expected for the three-month period ending April 30 were being pushed back until later in 2022.

    The SaaS company boss was speaking as Workday recorded an operating loss of $72.8 million in its first quarter [PDF] of fiscal '23, nearly double the $38.3 million loss recorded for the same period a year earlier. Workday also saw revenue increase to $1.43 billion in the period, up 22 percent year-on-year.

    However, the company increased its revenue guidance for the full financial year. It said revenues would be between $5.537 billion and $5.557 billion, an increase of 22 percent on earlier estimates.

    Continue reading

Biting the hand that feeds IT © 1998–2022