Mastering the Game of Go without Human Knowledge: Page-by-page notes
Contents
- Links
 - Abstract and Introduction
 - 1. Reinforcement Learning in AlphaGo Zero
 - 2. Empirical Analysis of AlphaGo Zero Training
 - 3. Knowledge Learned by AlphaGo Zero
 - 4. Final Performance of AlphaGo Zero
 - 5. Conclusion
 
Links
Abstract and Introduction
Page 1
- Solely based on RL without data, guidance or domain knowledge beyond rules
 - Precise and sophisticated lookahead needed for a game like Go
 
Page 2
- AlphaGo Fan, which defeated European champion Fan Hui, had a policy neural network trained to predict expert moves and a value neural net trainer to predict winners of games that policy network played against itself
 - Trained nets combined with MCTS to provide lookahead search using policy net to narrow search to high probability moves and value net along with MC rollouts to evaluate tree positions
 - This was followed by a similar model Alpha Go Lee which defeated Lee Sedol
 - AlphaGo Zero different as follows
    
- Trained solely by self-play RL, starting with random moves
 - Only black and white stones as input features
 - Single net not policy and value
 - Simpler tree search without MC rollouts
 
 - Achieves via RL algorithm that incorporates lookahead search inside training loop - result is rapid improvement and precise and stable learning
 
1. Reinforcement Learning in AlphaGo Zero
Page 3 (last paragraph continues into Page 5)
- Neural net $f_\theta$ takes as input a raw board representation of the position $s$ and its history and outputs $(\mathbf{p}, v)$ where $p_a = \mathrm{Pr}(a\vert s)$.
 - In each position MCTS is executed with the help of the net and outputs new probabilities $\boldsymbol{\pi}$ for playing each move.
 - $\boldsymbol{\pi}$ tends to result in much stronger moves than the raw probabilities $\mathbf{p}$.
 - Can see MCTS as a policy improvement operator.
 - Self-play with search - using $\boldsymbol{\pi}$ to make the move and using winner $z$ as sample of the value can be seen as policy evaluation operator.
 - Main idea of RL algorithm is to use these search operators repeatedly in policy iteration whereby neural net parameters are updated to make $(\mathbf{p}, v)$ closely match $(\boldsymbol{\pi}, z)$.
 - The improved parameters are then used in the next iteration to make the search even stronger.
 - MCTS details:
    
- Each search tree edge $(s, a)$ stores prior probability $P(s, a)$, visit count $N(s, a)$, action value $Q(s, a)$.
 - Simulation begins from the root state, iteratively selects moves that maximise upper confidence bound $Q(s, a) + U(s, a)$ where $U(s, a) \propto P(s, a) / (1 + N(s, a))$ until leaf node encountered.
 - Leaf node expanded and evaluated only once by net to generate both the prior probabilities and evaluation $(P(s’, .), V(s’)) = f_\theta(s’)$.
 - Updates
 
- $s, a \rightarrow s’$ means simulation eventually reached $s’$ after it took move $a$ from position $s$.
 
 
Page 4
Figure 1: Self-play reinforcement learning in AlphaGo Zero

Part (a)
- Program plays a game $s_1 \ldots s_T$ against itself
 - In each position $s_t$ model executes MCTS using the latest version of the model $f_\theta$
 - Moves selected according to search probabilities computed by MCTS, $a_t \sim \pi_t$.
 - Terminal position $s_T$ scored according to game rules to compute winner $z$.
 
Part (b)
- NN takes as input $s_t$, the raw board position and outputs a vector $\mathbf{p}_t$ representing probability distribution over moves and $v_t$ a scalar representing the probability of the current player winning in $s_t$
 - NN parameters $\theta$ updated to maximise similarity of the policy vector $\mathbf{p}_t$ to search probabilities $\boldsymbol{\pi}_t$
 - Also to minimise error between predicted winner $v_t$ and game winner $z$.
 - New parameters used in next iteration of self-play $\mathbf{a}$.
 
Page 5
- 
    
Can view MCTS as a self-play algorithm that given $\theta$ and a root position $s$ computes a vector of search probabilities that recommend moves $\boldsymbol{\pi} = \alpha_\theta(s)$. where $\pi_a \propto N(s, a)^{1/\tau}$ - $\tau$-th root of visit count for each move for some temperature $\tau$.
 - NN trained by self-play RL algorithm that uses MCTS to play each move:
 - Init to random $\theta_0$
 - At each iteration $i$ games of self-play generated
 - At each time step $t$ MCTS search $\boldsymbol{\pi}_t$ is executed using previous iteration of NN $f_{\theta_{i-1}}$ and move is sampled from $\boldsymbol{\pi}_t$.
 - Game terminates at step $T$ when:
    
- Both players pass; or
 - Search value <= resignation threshold; or
 - Exceeds max length
 
 - Game then score to give final reward $r_T=\pm 1$
 - Data for each time step: $(s_t, \boldsymbol{\pi}_t, z_t)$ where sign of $z_t$ depends on whether current player is winner.
 - New parameters $\theta_i$ trained from data $(s, \boldsymbol{\pi}, z)$ sampled uniformly among all time-steps of the last iteration of self-play.
 - NN adjusted to min error between pred value $v$ and self-play winner $z$ (using MSE) and NN move probabilities $\mathbf{p}$ and search probabilities $\boldsymbol{\pi}$ (using CE loss).
 - Total loss is a regularisation term plus sum of the two losses
 
Figure 2: Monte-Carlo tree search in AlphaGo Zero

- Steps (a),(b),(c) repeated
 
Part (a) Select
- Each simulation traverses tree by selected edge with max $Q + U$ (action-value + upper confidence bound)
 - $U$ depends on stored prior prob $P$ and visit count $N$ for edge which is incremented each time after it is traversed
 
Part (b) Expand and evaluate
- Leaf node is expanded and NN is used to evaluate $P(s,\cdot) , V(s)$ for the associated position $s$
 - Outgoing edges from $s$ store $P$ which is a vector.
 
Part (c) Backup
- Action values $Q$ updated to track mean of all evaluations $V$ in subtree below that action
 
Part (d) Play
- Search probs $\boldsymbol{\pi}$ are returned proportional $N^{1/\tau}$:
    
- $N$: visit count of each move from root state
 - $\tau$: parameter controlling temperature
 
 
2. Empirical Analysis of AlphaGo Zero Training
Page 6
- 3 days of training involving 4.9 million self-play games, using 1600 simulations for each MCTS ~ 0.4s thinking time per move.
 - 700k mini-batches of 2048 positions used to update $\theta$
 
Page 7
Figure 3: Empirical evaluation of AlphaGo Zero

Part (a)
- RL versus supervised learning from human data. Smooth progression of learning - no oscillations or catastrophic forgetting
 - AlphaGo Zero outperformed AlphaGo Lee after just 36 hours - latter trained over several months
 - Using a single machine with 4 TPUs fully trained model defeated AlphaGo Lee by 100 games to 0.
 
Part (b), (c)
- Supervised learning better at predicting human professional moves and has better performance initially but after about 24 hours, RL takes over
 
Page 9
Figure 4: Comparison of neural network architectures in AlphaGo Zero and AlphaGo Lee

- Four different models that used either separate or combined policy and value networks (“dual”) and residual or convolutional architectures
 - Each net trained to min same loss function using a fixed dataset of self-play games generated by AlphaGo Zero after 72 hours of self-play training.
 
Part (a)
- Each model was combined with AlphaGo Zero’s search to obtain a different player
 - Elo ratings were computed from eval games between the different players - 5s thinking time per move
 
Part (b), (c)
- Residual nets more accurate at predicting human moves and had lower error at predicting professional game outcomes.
 
3. Knowledge Learned by AlphaGo Zero
Page 10
- AlphaGo Zero managed to discover existing corner sequences (joseki) but ultimately preferred ones that were previously unknown
 - From entirely random moves, AlphaGo Zero rapidly went towards a sophisticated understanding of Go concepts
 - But one of the first elements learned by humans (schicho) was only understood by AlphaGo Zero much later in training
 
4. Final Performance of AlphaGo Zero
Page 10
- Larger NN and longer duration
 - Trained from scratch for ~40 days
 - 29 million games of self-play were generated and the weights were updated from 3.1 million mini-batches of 2,048 positions each.
 - NN had 40 res blocks
 
Page 12
- Overtook AlphaGo Lee by a lot after a few days but took over a month to overtakeAlphaGo Master by only a little (Figure 6a)
 - Evaluated using internal tournament against the following models, all of which it outperformed:
    
- AlphaGo Fan (defeated Fan Hui)
 - AlphaGo Lee (defeated Lee Sedol)
 - AlphaGo Master (based on same algorithm and architecture but using human data and features, which defeated top human professionals 60-0 in online games)
 - Raw version of AlphaGo Zero which selects move $a$ with max probability $p_a$ without MCTS
 - Previous algorithms for GO
 
 - AlphaGo Zero won 89 games to 11 when evaluated head to head against AlphaGo Master in a 100 game match with 2 hour time controls
 
5. Conclusion
Paper results have demonstrated:
- Pure RL approach is fully feasible even in the most challenging of domains
 - Possible to train to superhuman level without human supervision beyond basic rules about domain
 - Trains faster and gets much better asymptotic performance compared to training on expert data