Mastering the Game of Go without Human Knowledge: Page-by-page notes
Contents
- Links
- Abstract and Introduction
- 1. Reinforcement Learning in AlphaGo Zero
- 2. Empirical Analysis of AlphaGo Zero Training
- 3. Knowledge Learned by AlphaGo Zero
- 4. Final Performance of AlphaGo Zero
- 5. Conclusion
Links
Abstract and Introduction
Page 1
- Solely based on RL without data, guidance or domain knowledge beyond rules
- Precise and sophisticated lookahead needed for a game like Go
Page 2
- AlphaGo Fan, which defeated European champion Fan Hui, had a policy neural network trained to predict expert moves and a value neural net trainer to predict winners of games that policy network played against itself
- Trained nets combined with MCTS to provide lookahead search using policy net to narrow search to high probability moves and value net along with MC rollouts to evaluate tree positions
- This was followed by a similar model Alpha Go Lee which defeated Lee Sedol
- AlphaGo Zero different as follows
- Trained solely by self-play RL, starting with random moves
- Only black and white stones as input features
- Single net not policy and value
- Simpler tree search without MC rollouts
- Achieves via RL algorithm that incorporates lookahead search inside training loop - result is rapid improvement and precise and stable learning
1. Reinforcement Learning in AlphaGo Zero
Page 3 (last paragraph continues into Page 5)
- Neural net fθ takes as input a raw board representation of the position s and its history and outputs (p,v) where pa=Pr(a|s).
- In each position MCTS is executed with the help of the net and outputs new probabilities \boldsymbol{\pi} for playing each move.
- \boldsymbol{\pi} tends to result in much stronger moves than the raw probabilities \mathbf{p}.
- Can see MCTS as a policy improvement operator.
- Self-play with search - using \boldsymbol{\pi} to make the move and using winner z as sample of the value can be seen as policy evaluation operator.
- Main idea of RL algorithm is to use these search operators repeatedly in policy iteration whereby neural net parameters are updated to make (\mathbf{p}, v) closely match (\boldsymbol{\pi}, z).
- The improved parameters are then used in the next iteration to make the search even stronger.
- MCTS details:
- Each search tree edge (s, a) stores prior probability P(s, a), visit count N(s, a), action value Q(s, a).
- Simulation begins from the root state, iteratively selects moves that maximise upper confidence bound Q(s, a) + U(s, a) where U(s, a) \propto P(s, a) / (1 + N(s, a)) until leaf node encountered.
- Leaf node expanded and evaluated only once by net to generate both the prior probabilities and evaluation (P(s’, .), V(s’)) = f_\theta(s’).
- Updates
- s, a \rightarrow s’ means simulation eventually reached s’ after it took move a from position s.
Page 4
Figure 1: Self-play reinforcement learning in AlphaGo Zero
Part (a)
- Program plays a game s_1 \ldots s_T against itself
- In each position s_t model executes MCTS using the latest version of the model f_\theta
- Moves selected according to search probabilities computed by MCTS, a_t \sim \pi_t.
- Terminal position s_T scored according to game rules to compute winner z.
Part (b)
- NN takes as input s_t, the raw board position and outputs a vector \mathbf{p}_t representing probability distribution over moves and v_t a scalar representing the probability of the current player winning in s_t
- NN parameters \theta updated to maximise similarity of the policy vector \mathbf{p}_t to search probabilities \boldsymbol{\pi}_t
- Also to minimise error between predicted winner v_t and game winner z.
- New parameters used in next iteration of self-play \mathbf{a}.
Page 5
-
Can view MCTS as a self-play algorithm that given \theta and a root position s computes a vector of search probabilities that recommend moves \boldsymbol{\pi} = \alpha_\theta(s). where \pi_a \propto N(s, a)^{1/\tau} - \tau-th root of visit count for each move for some temperature \tau.
- NN trained by self-play RL algorithm that uses MCTS to play each move:
- Init to random \theta_0
- At each iteration i games of self-play generated
- At each time step t MCTS search \boldsymbol{\pi}_t is executed using previous iteration of NN f_{\theta_{i-1}} and move is sampled from \boldsymbol{\pi}_t.
- Game terminates at step T when:
- Both players pass; or
- Search value <= resignation threshold; or
- Exceeds max length
- Game then score to give final reward r_T=\pm 1
- Data for each time step: (s_t, \boldsymbol{\pi}_t, z_t) where sign of z_t depends on whether current player is winner.
- New parameters \theta_i trained from data (s, \boldsymbol{\pi}, z) sampled uniformly among all time-steps of the last iteration of self-play.
- NN adjusted to min error between pred value v and self-play winner z (using MSE) and NN move probabilities \mathbf{p} and search probabilities \boldsymbol{\pi} (using CE loss).
- Total loss is a regularisation term plus sum of the two losses
Figure 2: Monte-Carlo tree search in AlphaGo Zero
- Steps (a),(b),(c) repeated
Part (a) Select
- Each simulation traverses tree by selected edge with max Q + U (action-value + upper confidence bound)
- U depends on stored prior prob P and visit count N for edge which is incremented each time after it is traversed
Part (b) Expand and evaluate
- Leaf node is expanded and NN is used to evaluate P(s,\cdot) , V(s) for the associated position s
- Outgoing edges from s store P which is a vector.
Part (c) Backup
- Action values Q updated to track mean of all evaluations V in subtree below that action
Part (d) Play
- Search probs \boldsymbol{\pi} are returned proportional N^{1/\tau}:
- N: visit count of each move from root state
- \tau: parameter controlling temperature
2. Empirical Analysis of AlphaGo Zero Training
Page 6
- 3 days of training involving 4.9 million self-play games, using 1600 simulations for each MCTS ~ 0.4s thinking time per move.
- 700k mini-batches of 2048 positions used to update \theta
Page 7
Figure 3: Empirical evaluation of AlphaGo Zero
Part (a)
- RL versus supervised learning from human data. Smooth progression of learning - no oscillations or catastrophic forgetting
- AlphaGo Zero outperformed AlphaGo Lee after just 36 hours - latter trained over several months
- Using a single machine with 4 TPUs fully trained model defeated AlphaGo Lee by 100 games to 0.
Part (b), (c)
- Supervised learning better at predicting human professional moves and has better performance initially but after about 24 hours, RL takes over
Page 9
Figure 4: Comparison of neural network architectures in AlphaGo Zero and AlphaGo Lee
- Four different models that used either separate or combined policy and value networks (“dual”) and residual or convolutional architectures
- Each net trained to min same loss function using a fixed dataset of self-play games generated by AlphaGo Zero after 72 hours of self-play training.
Part (a)
- Each model was combined with AlphaGo Zero’s search to obtain a different player
- Elo ratings were computed from eval games between the different players - 5s thinking time per move
Part (b), (c)
- Residual nets more accurate at predicting human moves and had lower error at predicting professional game outcomes.
3. Knowledge Learned by AlphaGo Zero
Page 10
- AlphaGo Zero managed to discover existing corner sequences (joseki) but ultimately preferred ones that were previously unknown
- From entirely random moves, AlphaGo Zero rapidly went towards a sophisticated understanding of Go concepts
- But one of the first elements learned by humans (schicho) was only understood by AlphaGo Zero much later in training
4. Final Performance of AlphaGo Zero
Page 10
- Larger NN and longer duration
- Trained from scratch for ~40 days
- 29 million games of self-play were generated and the weights were updated from 3.1 million mini-batches of 2,048 positions each.
- NN had 40 res blocks
Page 12
- Overtook AlphaGo Lee by a lot after a few days but took over a month to overtakeAlphaGo Master by only a little (Figure 6a)
- Evaluated using internal tournament against the following models, all of which it outperformed:
- AlphaGo Fan (defeated Fan Hui)
- AlphaGo Lee (defeated Lee Sedol)
- AlphaGo Master (based on same algorithm and architecture but using human data and features, which defeated top human professionals 60-0 in online games)
- Raw version of AlphaGo Zero which selects move a with max probability p_a without MCTS
- Previous algorithms for GO
- AlphaGo Zero won 89 games to 11 when evaluated head to head against AlphaGo Master in a 100 game match with 2 hour time controls
5. Conclusion
Paper results have demonstrated:
- Pure RL approach is fully feasible even in the most challenging of domains
- Possible to train to superhuman level without human supervision beyond basic rules about domain
- Trains faster and gets much better asymptotic performance compared to training on expert data