Xingyu Liu - Machine Learningcs229.stanford.edu/proj2016/poster/Liu-GoGoGo Improving Deep...

Preview:

Citation preview

GoGoGo: Improving Deep Neural Network Based Go Playing AI with Residual Networks

Introduction

Xingyu Liu

CONV3, 64

CONV3, 64

Input (19x19x5)

CONV3, 64

CONV3, 64

CONV3, 64

CONV3, 64

CONV3, 64

CONV3, 64

CONV3, 64

CONV3, 64

CONV3, 64

CONV3, 64

Output

CONV3 64

P

Fig. 2 (a) Policy Network

CONV3, 64

CONV3, 64

Input (19x19x5)

CONV3, 64

CONV3, 64

CONV3, 64

CONV3, 64

CONV3, 64

CONV3, 64

CONV3, 64

CONV3, 64

CONV3, 64

CONV3, 64

CONV3 64

(b) Value Network

Network Architecture Experiment Result

• Go playing AIs using Traditional Search:

GNU Go, Pachi, Fuego, Zen etc.

Training Methodology and Data

• From Kifu to Input Feature Maps

Channels: 1) Space Positions; 2) Black Positions; 3)

White Positions; 4) Current Player; 5) Ko Positions

[1] David Silver et al., “Mastering the game of go with deep neural networks and

tree search”, Nature, 529:484–503, 2016.

[2] Kaiming He et al, “Deep residual learning for image recognition”, CoRR,

abs/1512.03385, 2015.

SL on Policy

Network

RL on Policy

Network

RL on Value

Network

• Dynamic Board State Expansion

Ko fight performing. Saves disk space. Small Mem

• Use Ing Chang-ki rule

Board State + Ko is Game State, No need to

remember the number of captured stones

• Two Levels of Batches (Kifus, moves)

Random Shuffling. Mem usage small and locality.

• Powered by Deep Learning:

Zen → Deep Zen Go, darkforest, AlphaGo

• Goal: From by Vanilla CNN to ResNets

CONV3

Batch Norm

ReLU

CONV3

Batch Norm

ReLU

Data

Eltwise Add

(c) Residual Module

FC23104 1

Hyperparameters Value

Base learning rate 2E-4

Decay Policy Exp

Decay Rate 0.95

Decay Step (kifu) 200

Loss Function Softmax

(b) Hyperparameters

Future Work Fig 1. Ko fight explicity

expansion

• Training Accuracy ~ 32%

• Testing Accuracy ~ 26%

Fig. 3 GoGoGo plays against itself, policy network only

0

1

2

3

4

5

6

7

1183

365

547

729

911

1093

1275

1457

1639

1821

2003

2185

2367

2549

2731

2913

3095

3277

3459

3641

3823

4005

4187

4369

4551

4733

4915

5097

5279

5461

5643

5825

6007

6189

6371

6553

6735

6917

7099

7281

7463

7645

7827

8009

8191

8373

8555

8737

Supervised Learning Training Loss

/100 batches

• Monte Carlo Tree Search

𝑎𝑡 = argmax𝑎

(𝑄 𝑠𝑡 , 𝑎 + 𝑢(𝑠𝑡, 𝑎))

𝑢(𝑠, 𝑎) ∝𝑃(𝑠, 𝑎)

1 + 𝑁(𝑠, 𝑎)

• Reinforcement Learning of Value Network

• Network Architecture Exploration

• Real Match Testing against Human Players

Recommended