Mastering the Game of Go without Human Knowledge: Page-by-page notes

Links
Abstract and Introduction
- Page 1
- Page 2
1. Reinforcement Learning in AlphaGo Zero
2. Empirical Analysis of AlphaGo Zero Training
3. Knowledge Learned by AlphaGo Zero
- Page 10
4. Final Performance of AlphaGo Zero
- Page 10
- Page 12
5. Conclusion

Abstract and Introduction

Page 1

Solely based on RL without data, guidance or domain knowledge beyond rules
Precise and sophisticated lookahead needed for a game like Go

Page 2

AlphaGo Fan, which defeated European champion Fan Hui, had a policy neural network trained to predict expert moves and a value neural net trainer to predict winners of games that policy network played against itself
Trained nets combined with MCTS to provide lookahead search using policy net to narrow search to high probability moves and value net along with MC rollouts to evaluate tree positions
This was followed by a similar model Alpha Go Lee which defeated Lee Sedol
AlphaGo Zero different as follows
1. Trained solely by self-play RL, starting with random moves
2. Only black and white stones as input features
3. Single net not policy and value
4. Simpler tree search without MC rollouts
Achieves via RL algorithm that incorporates lookahead search inside training loop - result is rapid improvement and precise and stable learning

1. Reinforcement Learning in AlphaGo Zero

Page 3 (last paragraph continues into Page 5)

Neural net $f_\theta$ takes as input a raw board representation of the position $s$ and its history and outputs $(\mathbf{p}, v)$ where $p_a = \mathrm{Pr}(a\vert s)$ .
In each position MCTS is executed with the help of the net and outputs new probabilities $\boldsymbol{\pi}$ for playing each move.
$\boldsymbol{\pi}$ tends to result in much stronger moves than the raw probabilities $\mathbf{p}$ .
Can see MCTS as a policy improvement operator.
Self-play with search - using $\boldsymbol{\pi}$ to make the move and using winner $z$ as sample of the value can be seen as policy evaluation operator.
Main idea of RL algorithm is to use these search operators repeatedly in policy iteration whereby neural net parameters are updated to make $(\mathbf{p}, v)$ closely match $(\boldsymbol{\pi}, z)$ .
The improved parameters are then used in the next iteration to make the search even stronger.
MCTS details:
- Each search tree edge $(s, a)$ stores prior probability $P(s, a)$ , visit count $N(s, a)$ , action value $Q(s, a)$ .
- Simulation begins from the root state, iteratively selects moves that maximise upper confidence bound $Q(s, a) + U(s, a)$ where $U(s, a) \propto P(s, a) / (1 + N(s, a))$ until leaf node encountered.
- Leaf node expanded and evaluated only once by net to generate both the prior probabilities and evaluation $(P(s’, .), V(s’)) = f_\theta(s’)$ .
- Updates
N(s, a) \leftarrow N(s, a) + 1 \\ Q(s, a) = 1/N(s, a) \sum_{s' \vert s, a \rightarrow s'} V(s')
- $s, a \rightarrow s’$ means simulation eventually reached $s’$ after it took move $a$ from position $s$ .

Page 4

Figure 1: Self-play reinforcement learning in AlphaGo Zero

Figure 1 from the paper

Part (a)

Program plays a game $s_1 \ldots s_T$ against itself
In each position $s_t$ model executes MCTS using the latest version of the model $f_\theta$
Moves selected according to search probabilities computed by MCTS, $a_t \sim \pi_t$ .
Terminal position $s_T$ scored according to game rules to compute winner $z$ .

Part (b)

NN takes as input $s_t$ , the raw board position and outputs a vector $\mathbf{p}_t$ representing probability distribution over moves and $v_t$ a scalar representing the probability of the current player winning in $s_t$
NN parameters $\theta$ updated to maximise similarity of the policy vector $\mathbf{p}_t$ to search probabilities $\boldsymbol{\pi}_t$
Also to minimise error between predicted winner and game winner $z$ .
New parameters used in next iteration of self-play $\mathbf{a}$ .

Page 5

Can view MCTS as a self-play algorithm that given $\theta$ and a root position $s$ computes a vector of search probabilities that recommend moves $\boldsymbol{\pi} = \alpha_\theta(s)$ . where $\pi_a \propto N(s, a)^{1/\tau}$ - $\tau$ -th root of visit count for each move for some temperature $\tau$ .
NN trained by self-play RL algorithm that uses MCTS to play each move:
Init to random $\theta_0$
At each iteration $i$ games of self-play generated
At each time step $t$ MCTS search $\boldsymbol{\pi}_t$ is executed using previous iteration of NN $f_{\theta_{i-1}}$ and move is sampled from $\boldsymbol{\pi}_t$ .
Game terminates at step T when:
- Both players pass; or
- Search value <= resignation threshold; or
- Exceeds max length
Game then score to give final reward $r_T=\pm 1$
Data for each time step: $(s_t, \boldsymbol{\pi}_t, z_t)$ where sign of $z_t$ depends on whether current player is winner.
New parameters $\theta_i$ trained from data $(s, \boldsymbol{\pi}, z)$ sampled uniformly among all time-steps of the last iteration of self-play.
NN adjusted to min error between pred value $v$ and self-play winner $z$ (using MSE) and NN move probabilities $\mathbf{p}$ and search probabilities $\boldsymbol{\pi}$ (using CE loss).
Total loss is a regularisation term plus sum of the two losses

Figure 2: Monte-Carlo tree search in AlphaGo Zero

Figure 2 from the paper

Steps (a),(b),(c) repeated

Part (a) Select

Each simulation traverses tree by selected edge with max $Q + U$ (action-value + upper confidence bound)
$U$ depends on stored prior prob $P$ and visit count $N$ for edge which is incremented each time after it is traversed

Part (b) Expand and evaluate

Leaf node is expanded and NN is used to evaluate $P(s,\cdot) , V(s)$ for the associated position $s$
Outgoing edges from $s$ store $P$ which is a vector.

Part (c) Backup

Action values $Q$ updated to track mean of all evaluations $V$ in subtree below that action

Part (d) Play

Search probs \boldsymbol{\pi} are returned proportional N^{1/\tau}:
- $N$ : visit count of each move from root state
- $\tau$ : parameter controlling temperature

2. Empirical Analysis of AlphaGo Zero Training

Page 6

3 days of training involving 4.9 million self-play games, using 1600 simulations for each MCTS ~ 0.4s thinking time per move.
700k mini-batches of 2048 positions used to update $\theta$

Page 7

Figure 3: Empirical evaluation of AlphaGo Zero

Figure 3 from the paper

Part (a)

RL versus supervised learning from human data. Smooth progression of learning - no oscillations or catastrophic forgetting
AlphaGo Zero outperformed AlphaGo Lee after just 36 hours - latter trained over several months
Using a single machine with 4 TPUs fully trained model defeated AlphaGo Lee by 100 games to 0.

Part (b), (c)

Supervised learning better at predicting human professional moves and has better performance initially but after about 24 hours, RL takes over

Page 9

Figure 4: Comparison of neural network architectures in AlphaGo Zero and AlphaGo Lee

Figure 4 from the paper

Four different models that used either separate or combined policy and value networks (“dual”) and residual or convolutional architectures
Each net trained to min same loss function using a fixed dataset of self-play games generated by AlphaGo Zero after 72 hours of self-play training.

Part (a)

Each model was combined with AlphaGo Zero’s search to obtain a different player
Elo ratings were computed from eval games between the different players - 5s thinking time per move

Part (b), (c)

Residual nets more accurate at predicting human moves and had lower error at predicting professional game outcomes.

3. Knowledge Learned by AlphaGo Zero

Page 10

AlphaGo Zero managed to discover existing corner sequences (joseki) but ultimately preferred ones that were previously unknown
From entirely random moves, AlphaGo Zero rapidly went towards a sophisticated understanding of Go concepts
But one of the first elements learned by humans (schicho) was only understood by AlphaGo Zero much later in training

4. Final Performance of AlphaGo Zero

Page 10

Larger NN and longer duration
Trained from scratch for ~40 days
29 million games of self-play were generated and the weights were updated from 3.1 million mini-batches of 2,048 positions each.
NN had 40 res blocks

Page 12

Overtook AlphaGo Lee by a lot after a few days but took over a month to overtakeAlphaGo Master by only a little (Figure 6a)
Evaluated using internal tournament against the following models, all of which it outperformed:
- AlphaGo Fan (defeated Fan Hui)
- AlphaGo Lee (defeated Lee Sedol)
- AlphaGo Master (based on same algorithm and architecture but using human data and features, which defeated top human professionals 60-0 in online games)
- Raw version of AlphaGo Zero which selects move $a$ with max probability $p_a$ without MCTS
- Previous algorithms for GO
AlphaGo Zero won 89 games to 11 when evaluated head to head against AlphaGo Master in a 100 game match with 2 hour time controls

5. Conclusion

Paper results have demonstrated:

Pure RL approach is fully feasible even in the most challenging of domains
Possible to train to superhuman level without human supervision beyond basic rules about domain
Trains faster and gets much better asymptotic performance compared to training on expert data

Contents

Links

Abstract and Introduction

Page 1

Page 2

1. Reinforcement Learning in AlphaGo Zero

Page 3 (last paragraph continues into Page 5)

Page 4

Figure 1: Self-play reinforcement learning in AlphaGo Zero

Page 5

Figure 2: Monte-Carlo tree search in AlphaGo Zero

2. Empirical Analysis of AlphaGo Zero Training

Page 6

Page 7

Figure 3: Empirical evaluation of AlphaGo Zero

Page 9

Figure 4: Comparison of neural network architectures in AlphaGo Zero and AlphaGo Lee

3. Knowledge Learned by AlphaGo Zero

Page 10

4. Final Performance of AlphaGo Zero

Page 10

Page 12

5. Conclusion