Contents

Abstract and Introduction

Page 1

  • Solely based on RL without data, guidance or domain knowledge beyond rules
  • Precise and sophisticated lookahead needed for a game like Go

Page 2

  • AlphaGo Fan, which defeated European champion Fan Hui, had a policy neural network trained to predict expert moves and a value neural net trainer to predict winners of games that policy network played against itself
  • Trained nets combined with MCTS to provide lookahead search using policy net to narrow search to high probability moves and value net along with MC rollouts to evaluate tree positions
  • This was followed by a similar model Alpha Go Lee which defeated Lee Sedol
  • AlphaGo Zero different as follows
    1. Trained solely by self-play RL, starting with random moves
    2. Only black and white stones as input features
    3. Single net not policy and value
    4. Simpler tree search without MC rollouts
  • Achieves via RL algorithm that incorporates lookahead search inside training loop - result is rapid improvement and precise and stable learning

1. Reinforcement Learning in AlphaGo Zero

Page 3 (last paragraph continues into Page 5)

  • Neural net fθ takes as input a raw board representation of the position s and its history and outputs (p,v) where pa=Pr(a|s).
  • In each position MCTS is executed with the help of the net and outputs new probabilities \boldsymbol{\pi} for playing each move.
  • \boldsymbol{\pi} tends to result in much stronger moves than the raw probabilities \mathbf{p}.
  • Can see MCTS as a policy improvement operator.
  • Self-play with search - using \boldsymbol{\pi} to make the move and using winner z as sample of the value can be seen as policy evaluation operator.
  • Main idea of RL algorithm is to use these search operators repeatedly in policy iteration whereby neural net parameters are updated to make (\mathbf{p}, v) closely match (\boldsymbol{\pi}, z).
  • The improved parameters are then used in the next iteration to make the search even stronger.
  • MCTS details:
    • Each search tree edge (s, a) stores prior probability P(s, a), visit count N(s, a), action value Q(s, a).
    • Simulation begins from the root state, iteratively selects moves that maximise upper confidence bound Q(s, a) + U(s, a) where U(s, a) \propto P(s, a) / (1 + N(s, a)) until leaf node encountered.
    • Leaf node expanded and evaluated only once by net to generate both the prior probabilities and evaluation (P(s’, .), V(s’)) = f_\theta(s’).
    • Updates
    N(s, a) \leftarrow N(s, a) + 1 \\ Q(s, a) = 1/N(s, a) \sum_{s' \vert s, a \rightarrow s'} V(s')
    • s, a \rightarrow s’ means simulation eventually reached s’ after it took move a from position s.

Page 4

Figure 1: Self-play reinforcement learning in AlphaGo Zero

Figure 1 from the paper

Part (a)

  • Program plays a game s_1 \ldots s_T against itself
  • In each position s_t model executes MCTS using the latest version of the model f_\theta
  • Moves selected according to search probabilities computed by MCTS, a_t \sim \pi_t.
  • Terminal position s_T scored according to game rules to compute winner z.

Part (b)

  • NN takes as input s_t, the raw board position and outputs a vector \mathbf{p}_t representing probability distribution over moves and v_t a scalar representing the probability of the current player winning in s_t
  • NN parameters \theta updated to maximise similarity of the policy vector \mathbf{p}_t to search probabilities \boldsymbol{\pi}_t
  • Also to minimise error between predicted winner v_t and game winner z.
  • New parameters used in next iteration of self-play \mathbf{a}.

Page 5

  • Can view MCTS as a self-play algorithm that given \theta and a root position s computes a vector of search probabilities that recommend moves \boldsymbol{\pi} = \alpha_\theta(s). where \pi_a \propto N(s, a)^{1/\tau} - \tau-th root of visit count for each move for some temperature \tau.

  • NN trained by self-play RL algorithm that uses MCTS to play each move:
  • Init to random \theta_0
  • At each iteration i games of self-play generated
  • At each time step t MCTS search \boldsymbol{\pi}_t is executed using previous iteration of NN f_{\theta_{i-1}} and move is sampled from \boldsymbol{\pi}_t.
  • Game terminates at step T when:
    • Both players pass; or
    • Search value <= resignation threshold; or
    • Exceeds max length
  • Game then score to give final reward r_T=\pm 1
  • Data for each time step: (s_t, \boldsymbol{\pi}_t, z_t) where sign of z_t depends on whether current player is winner.
  • New parameters \theta_i trained from data (s, \boldsymbol{\pi}, z) sampled uniformly among all time-steps of the last iteration of self-play.
  • NN adjusted to min error between pred value v and self-play winner z (using MSE) and NN move probabilities \mathbf{p} and search probabilities \boldsymbol{\pi} (using CE loss).
  • Total loss is a regularisation term plus sum of the two losses
Figure 2: Monte-Carlo tree search in AlphaGo Zero

Figure 2 from the paper

  • Steps (a),(b),(c) repeated

Part (a) Select

  • Each simulation traverses tree by selected edge with max Q + U (action-value + upper confidence bound)
  • U depends on stored prior prob P and visit count N for edge which is incremented each time after it is traversed

Part (b) Expand and evaluate

  • Leaf node is expanded and NN is used to evaluate P(s,\cdot) , V(s) for the associated position s
  • Outgoing edges from s store P which is a vector.

Part (c) Backup

  • Action values Q updated to track mean of all evaluations V in subtree below that action

Part (d) Play

  • Search probs \boldsymbol{\pi} are returned proportional N^{1/\tau}:
    • N: visit count of each move from root state
    • \tau: parameter controlling temperature

2. Empirical Analysis of AlphaGo Zero Training

Page 6

  • 3 days of training involving 4.9 million self-play games, using 1600 simulations for each MCTS ~ 0.4s thinking time per move.
  • 700k mini-batches of 2048 positions used to update \theta

Page 7

Figure 3: Empirical evaluation of AlphaGo Zero

Figure 3 from the paper

Part (a)

  • RL versus supervised learning from human data. Smooth progression of learning - no oscillations or catastrophic forgetting
  • AlphaGo Zero outperformed AlphaGo Lee after just 36 hours - latter trained over several months
  • Using a single machine with 4 TPUs fully trained model defeated AlphaGo Lee by 100 games to 0.

Part (b), (c)

  • Supervised learning better at predicting human professional moves and has better performance initially but after about 24 hours, RL takes over

Page 9

Figure 4: Comparison of neural network architectures in AlphaGo Zero and AlphaGo Lee

Figure 4 from the paper

  • Four different models that used either separate or combined policy and value networks (“dual”) and residual or convolutional architectures
  • Each net trained to min same loss function using a fixed dataset of self-play games generated by AlphaGo Zero after 72 hours of self-play training.

Part (a)

  • Each model was combined with AlphaGo Zero’s search to obtain a different player
  • Elo ratings were computed from eval games between the different players - 5s thinking time per move

Part (b), (c)

  • Residual nets more accurate at predicting human moves and had lower error at predicting professional game outcomes.

3. Knowledge Learned by AlphaGo Zero

Page 10

  • AlphaGo Zero managed to discover existing corner sequences (joseki) but ultimately preferred ones that were previously unknown
  • From entirely random moves, AlphaGo Zero rapidly went towards a sophisticated understanding of Go concepts
  • But one of the first elements learned by humans (schicho) was only understood by AlphaGo Zero much later in training

4. Final Performance of AlphaGo Zero

Page 10

  • Larger NN and longer duration
  • Trained from scratch for ~40 days
  • 29 million games of self-play were generated and the weights were updated from 3.1 million mini-batches of 2,048 positions each.
  • NN had 40 res blocks

Page 12

  • Overtook AlphaGo Lee by a lot after a few days but took over a month to overtakeAlphaGo Master by only a little (Figure 6a)
  • Evaluated using internal tournament against the following models, all of which it outperformed:
    • AlphaGo Fan (defeated Fan Hui)
    • AlphaGo Lee (defeated Lee Sedol)
    • AlphaGo Master (based on same algorithm and architecture but using human data and features, which defeated top human professionals 60-0 in online games)
    • Raw version of AlphaGo Zero which selects move a with max probability p_a without MCTS
    • Previous algorithms for GO
  • AlphaGo Zero won 89 games to 11 when evaluated head to head against AlphaGo Master in a 100 game match with 2 hour time controls

5. Conclusion

Paper results have demonstrated:

  • Pure RL approach is fully feasible even in the most challenging of domains
  • Possible to train to superhuman level without human supervision beyond basic rules about domain
  • Trains faster and gets much better asymptotic performance compared to training on expert data