The State of AI: AlphaZero and Reinforcement Learning

February 26, 2018

ai machine-learning

In December 2017, DeepMind published a paper that shook the game-playing AI world. AlphaZero, starting from nothing but the rules of chess, shogi, and Go, taught itself to play at superhuman levels—in just 24 hours.

No opening books. No endgame tables. No human games to study. Just self-play and reinforcement learning.

This isn’t just about games. It’s a glimpse of AI’s potential when we let machines discover strategies rather than hand-coding them.

What AlphaZero Achieved

The numbers are staggering:

Chess: After 9 hours of training, AlphaZero defeated Stockfish 8, the world’s strongest chess engine, winning 28 games, drawing 72, and losing 0 in a 100-game match.
Shogi: After 12 hours, it beat the strongest shogi program Elmo.
Go: After 13 days of training, it exceeded the level of AlphaGo Lee (which defeated world champion Lee Sedol).

Previous game-playing AIs relied heavily on human knowledge—opening databases, handcrafted evaluation functions, years of accumulated chess theory. AlphaZero threw all of that away and started from scratch.

How It Works

AlphaZero combines two powerful techniques: Monte Carlo Tree Search (MCTS) and deep neural networks.

The Neural Network

A single deep neural network takes the board position as input and outputs two things:

Policy: Probability distribution over possible moves
Value: Probability of winning from this position

The network learns entirely from self-play. No human games, no expert knowledge.

Monte Carlo Tree Search

MCTS explores the game tree by simulating games. At each node:

Use the policy network to guide which moves to explore
Expand promising nodes
Use the value network to evaluate positions
Backpropagate results to update statistics

The combination is elegant: the neural network provides intuition, MCTS provides look-ahead search.

Self-Play Training

The training loop is surprisingly simple:

Play games against yourself using the current network
Collect training data from those games
Train the network on the collected data
Repeat

Early games are essentially random. But as the network improves, the games become more sophisticated. The network learns to predict better moves and evaluate positions more accurately.

What Makes This Different

Previous approaches to game AI typically used:

Handcrafted evaluation functions: Human experts define what makes a position good
Opening books: Pre-computed best moves for opening positions
Endgame tablebases: Exhaustive solutions for endgame positions
Domain-specific optimizations: Alpha-beta pruning tuned over decades

AlphaZero uses none of these. Its evaluation function is learned, not designed. This has two major implications:

1. Generality: The same algorithm works for chess, shogi, and Go. Swap the game rules, retrain, done.

2. Novel strategies: AlphaZero’s play style differs from traditional computer chess. It plays more “human-like” in some ways—sacrificing material for positional advantages, preferring dynamic piece play over material counting.

Grandmasters analyzing its games found moves that contradicted decades of human chess theory.

Reinforcement Learning: The Bigger Picture

AlphaZero is a triumph of reinforcement learning (RL), a paradigm where agents learn from trial and error.

The core components:

Agent: Makes decisions (AlphaZero)
Environment: Provides feedback (the game)
Reward: Signal for success (win/lose/draw)
Policy: Strategy for choosing actions

RL has been around for decades, but deep learning made it powerful:

Deep Q-Networks (DQN) playing Atari games (2015)
AlphaGo defeating Lee Sedol (2016)
AlphaZero mastering multiple games (2017)

Implications for Software Development

Games are nice, but what does this mean for us as developers?

Automated Optimization

Many problems can be framed as games against nature:

Compiler optimization (find the best instruction sequence)
Resource allocation (minimize cost while meeting constraints)
Network routing (minimize latency)

If you can define a reward signal, RL might find surprising solutions.

Testing and Fuzzing

RL-based fuzzing is already being explored. Instead of random mutations, intelligent agents learn which inputs are likely to trigger bugs.

Infrastructure Management

Data center cooling, job scheduling, autoscaling—these are all sequential decision problems where RL shines.

Limitations and Caveats

AlphaZero is impressive, but let’s maintain perspective:

Compute Requirements: Training required 5,000 TPUs. This is not something you’ll run on your laptop.

Perfect Information: Games like chess have complete information. Real-world problems often involve uncertainty, partial observability, and noisy feedback.

Clear Rewards: Games have unambiguous win/lose signals. Real-world rewards are often delayed, sparse, or hard to define.

Simulation Speed: AlphaZero could play millions of games quickly. Many real-world domains don’t have fast simulators.

Getting Started with RL

If you want to explore reinforcement learning:

OpenAI Gym: The standard library for RL environments

import gym
env = gym.make('CartPole-v1')

Stable Baselines3: High-quality RL algorithm implementations

from stable_baselines3 import PPO
model = PPO('MlpPolicy', env)
model.learn(total_timesteps=10000)

Books:

“Reinforcement Learning: An Introduction” by Sutton and Barto (the bible)
“Deep Reinforcement Learning Hands-On” by Lapan (practical)

Courses:

David Silver’s RL course (available on YouTube)
Berkeley’s Deep RL course

What Comes Next

AlphaZero points toward a future where AI systems:

Learn from first principles rather than human knowledge
Discover strategies that surprise us
Generalize across domains with minimal modification

We’re not there for most real-world problems yet. But the trajectory is clear.

Watch this space. Reinforcement learning is just getting started.

The best move is the one the machine discovers, not the one we teach it.