1
GoGoGo: Improving Deep Neural Network Based Go Playing AI with Residual Networks Introduction Xingyu Liu CONV3, 64 CONV3, 64 Input (19x19x5) CONV3, 64 CONV3, 64 CONV3, 64 CONV3, 64 CONV3, 64 CONV3, 64 CONV3, 64 CONV3, 64 CONV3, 64 CONV3, 64 Output CONV3 64 P Fig. 2 (a) Policy Network CONV3, 64 CONV3, 64 Input (19x19x5) CONV3, 64 CONV3, 64 CONV3, 64 CONV3, 64 CONV3, 64 CONV3, 64 CONV3, 64 CONV3, 64 CONV3, 64 CONV3, 64 CONV3 64 (b) Value Network Network Architecture Experiment Result Go playing AIs using Traditional Search: GNU Go, Pachi, Fuego, Zen etc. Training Methodology and Data From Kifu to Input Feature Maps Channels: 1) Space Positions; 2) Black Positions; 3) White Positions; 4) Current Player; 5) Ko Positions [1] David Silver et al., “Mastering the game of go with deep neural networks and tree search”, Nature, 529:484–503, 2016. [2] Kaiming He et al, “Deep residual learning for image recognition”, CoRR, abs/1512.03385, 2015. SL on Policy Network RL on Policy Network RL on Value Network Dynamic Board State Expansion Ko fight performing. Saves disk space. Small Mem Use Ing Chang-ki rule Board State + Ko is Game State, No need to remember the number of captured stones Two Levels of Batches (Kifus, moves) Random Shuffling. Mem usage small and locality. Powered by Deep Learning: Zen Deep Zen Go, darkforest , AlphaGo Goal: From by Vanilla CNN to ResNets CONV3 Batch Norm ReLU CONV3 Batch Norm ReLU Data Eltwise Add (c) Residual Module FC23104 1 Hyperparameters Value Base learning rate 2E-4 Decay Policy Exp Decay Rate 0.95 Decay Step (kifu) 200 Loss Function Softmax (b) Hyperparameters Future Work Fig 1. Ko fight explicity expansion Training Accuracy ~ 32% Testing Accuracy ~ 26% Fig. 3 GoGoGo plays against itself, policy network only 0 1 2 3 4 5 6 7 1 183 365 547 729 911 1093 1275 1457 1639 1821 2003 2185 2367 2549 2731 2913 3095 3277 3459 3641 3823 4005 4187 4369 4551 4733 4915 5097 5279 5461 5643 5825 6007 6189 6371 6553 6735 6917 7099 7281 7463 7645 7827 8009 8191 8373 8555 8737 Supervised Learning Training Loss /100 batches Monte Carlo Tree Search = argmax ( , + ( , )) (, ) ∝ (, ) 1 + (, ) Reinforcement Learning of Value Network Network Architecture Exploration Real Match Testing against Human Players

Xingyu Liu - Machine Learningcs229.stanford.edu/proj2016/poster/Liu-GoGoGo Improving Deep Neural... · [1] David Silver et al., “Mastering the game of go with deep neural networks

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Xingyu Liu - Machine Learningcs229.stanford.edu/proj2016/poster/Liu-GoGoGo Improving Deep Neural... · [1] David Silver et al., “Mastering the game of go with deep neural networks

GoGoGo: Improving Deep Neural Network Based Go Playing AI with Residual Networks

Introduction

Xingyu Liu

CONV3, 64

CONV3, 64

Input (19x19x5)

CONV3, 64

CONV3, 64

CONV3, 64

CONV3, 64

CONV3, 64

CONV3, 64

CONV3, 64

CONV3, 64

CONV3, 64

CONV3, 64

Output

CONV3 64

P

Fig. 2 (a) Policy Network

CONV3, 64

CONV3, 64

Input (19x19x5)

CONV3, 64

CONV3, 64

CONV3, 64

CONV3, 64

CONV3, 64

CONV3, 64

CONV3, 64

CONV3, 64

CONV3, 64

CONV3, 64

CONV3 64

(b) Value Network

Network Architecture Experiment Result

• Go playing AIs using Traditional Search:

GNU Go, Pachi, Fuego, Zen etc.

Training Methodology and Data

• From Kifu to Input Feature Maps

Channels: 1) Space Positions; 2) Black Positions; 3)

White Positions; 4) Current Player; 5) Ko Positions

[1] David Silver et al., “Mastering the game of go with deep neural networks and

tree search”, Nature, 529:484–503, 2016.

[2] Kaiming He et al, “Deep residual learning for image recognition”, CoRR,

abs/1512.03385, 2015.

SL on Policy

Network

RL on Policy

Network

RL on Value

Network

• Dynamic Board State Expansion

Ko fight performing. Saves disk space. Small Mem

• Use Ing Chang-ki rule

Board State + Ko is Game State, No need to

remember the number of captured stones

• Two Levels of Batches (Kifus, moves)

Random Shuffling. Mem usage small and locality.

• Powered by Deep Learning:

Zen → Deep Zen Go, darkforest, AlphaGo

• Goal: From by Vanilla CNN to ResNets

CONV3

Batch Norm

ReLU

CONV3

Batch Norm

ReLU

Data

Eltwise Add

(c) Residual Module

FC23104 1

Hyperparameters Value

Base learning rate 2E-4

Decay Policy Exp

Decay Rate 0.95

Decay Step (kifu) 200

Loss Function Softmax

(b) Hyperparameters

Future Work Fig 1. Ko fight explicity

expansion

• Training Accuracy ~ 32%

• Testing Accuracy ~ 26%

Fig. 3 GoGoGo plays against itself, policy network only

0

1

2

3

4

5

6

7

1183

365

547

729

911

1093

1275

1457

1639

1821

2003

2185

2367

2549

2731

2913

3095

3277

3459

3641

3823

4005

4187

4369

4551

4733

4915

5097

5279

5461

5643

5825

6007

6189

6371

6553

6735

6917

7099

7281

7463

7645

7827

8009

8191

8373

8555

8737

Supervised Learning Training Loss

/100 batches

• Monte Carlo Tree Search

𝑎𝑡 = argmax𝑎

(𝑄 𝑠𝑡 , 𝑎 + 𝑢(𝑠𝑡, 𝑎))

𝑢(𝑠, 𝑎) ∝𝑃(𝑠, 𝑎)

1 + 𝑁(𝑠, 𝑎)

• Reinforcement Learning of Value Network

• Network Architecture Exploration

• Real Match Testing against Human Players