This article is more than 1 year old
How DeepMind's AlphaGo Zero learned all by itself to trash world champ AI AlphaGo
Self-play code excites machine-learning world
Analysis DeepMind published a paper today describing AlphaGo Zero – a leaner and meaner version of AlphaGo, the artificially intelligent program that crushed professional Go players.
Go was considered a difficult game for computers to master because, besides being complex, the number of possible moves – more than chess at 10170 – is greater than the number of atoms in the universe.
But AlphaGo, the predecessor to AlphaGo Zero, crushed 18-time world champion Lee Sedol and the reigning world number one player, Ke Jie. After beating Jie earlier this year, DeepMind announced AlphaGo was retiring from future competitions.
Now, an even more superior competitor is in town. AlphaGo Zero has, we're told, beaten AlphaGo 100-0 after training for just a fraction of the time AlphaGo needed, and it didn't learn from observing humans playing against each other – unlike AlphaGo. Instead, Zero's neural network relies on an old technique in reinforcement learning: self-play.
Machine 1, Man 0: AlphaGo slams world's best Go player in the first roundREAD MORE
Essentially, AlphaGo Zero plays against itself. During training, it sits on each side of the table: two instances of the same software face off against each other. A match starts with the game's black and white stones scattered on the board, placed following a random set of moves from their starting positions. The two computer players are given the list of moves that led to the positions of the stones on the grid, and then are each told to come up with multiple chains of next moves along with estimates of the probability they will win by following through each chain.
So, the black player could come up with four chains of next moves, and predict the third chain will be the most successful. The white player could come up with its own chains, and think its first choice is the strongest.
The next move from the best possible chain is then played, and the computer players repeat the above steps, coming up with chains of moves ranked by strength. This repeats over and over, with the software feeling its way through the game and internalizing which strategies turn out to be the strongest.
The old AlphaGo relied on a computationally intensive Monte Carlo tree search to play through Go scenarios. The nodes and branches created a much larger tree than AlphaGo practically needed to play. A combination of reinforcement learning and human-supervised learning was used to build "value" and "policy" neural networks that used the search tree to execute gameplay strategies. The software learned from 30 million moves played in human-on-human games, and benefited from various bodges and tricks to learn to win. For instance, it was trained from master-level human players, rather than picking it up from scratch.
AlphaGo Zero did start from scratch with no experts guiding it. And it is much more efficient: it only uses a single computer and four of Google's custom TPU1 chips to play matches, compared to AlphaGo's several machines and 48 TPUs. Since Zero didn't rely on human gameplay, and a smaller number of matches, its Monte Carlo tree search is smaller. The self-play algorithm also combined both the value and policy neural networks into one, and was trained on 64 GPUs and 19 CPUs over a few days by playing nearly five million games against itself. In comparison, AlphaGo needed months of training and used 1,920 CPUs and 280 GPUs to beat Lee Sedol.
Though self-play AlphaGo Zero even discovered for itself, without human intervention, classic moves in the theory of Go, such as fuseki opening tactics, and what's called life and death. More details can be found in Nature, or from the paper directly here. Stanford computer science academic Bharath Ramsundar has a summary of the more technical points, here.
Self-play is fun
Self-play is an established technique in reinforcement learning, and has been used to teach machines to play backgammon, chess, poker, and Scrabble. David Silver, a lead researcher on AlphaGo, said it’s an effective technique because the opponent is always the right level of difficulty.
“So it starts off extremely naive," he said. "But at every step of the learning process it has an opponent – a sparring partner if you like – that is exactly calibrated to its current level of performance. To begin with these players are very weak but over time they get progressively stronger.”
Your job might be automated within 120 years, AI experts reckonREAD MORE
Tim Salimans, a research scientist at OpenAI, explained to The Register that self-play means “agents can learn behaviours that are not hand coded on any reinforcement learning task, but the sophistication of the learned behavior is limited by the sophistication of the environment. In order for an agent to learn intelligent behavior in a particular environment, the environment has to be challenging, but not too challenging.
“The competitive element makes the agent explicitly search for its own weaknesses. Once those weaknesses are found the agent can improve them. In self-play the difficulty of the task the agent is solving is always reasonable, but over time it is open ended: since the opponent can always improve, the task can always get harder."
Mastering Go is all fine and dandy, but what else is self-play good for? Well, not much, except for board games, at the moment. There are problems that AlphaGo Zero cannot solve, such as games with hidden states or imperfect information, such as Starcraft, and it’s unlikely that self-play will be successful tackling more advanced challenges.
The self-play approach will be worthwhile in some areas of AI, however. Salimans said: “As our algorithms for reinforcement learning become more powerful the bottleneck in developing artificial intelligence will gradually shift to developing sufficiently sophisticated tasks and environments. Even very talented people will not develop a great intellect if they are not exposed to the right environment.”
DeepMind believes that the approach may be generalizable to a wider set of scenarios that share similar properties to a game like Go. It could prove useful in planning tasks or problems where a series of actions have to be taken in the correct sequence such as protein folding, reducing energy consumption, or searching for new materials. ®