62
Lecture 15: MCTS 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2021 1 With many slides from or derived from David Silver Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 1 / 66

Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Lecture 15: MCTS 1

Emma Brunskill

CS234 Reinforcement Learning.

Winter 2021

1With many slides from or derived from David SilverEmma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 1 / 66

Page 2: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Refresh Your Understanding: Batch RL

Select all that are true:1 Batch RL refers to when we have many agents acting in a batch2 In batch RL we generally care more about sample efficiency than

computational efficiency3 Importance sampling can be used to get an unbiased estimate of policy

performance4 Q-learning can be used in batch RL and will generally provide a better

estimate than importance sampling in Markov environments for anyfunction approximator used for the Q

5 Not sure

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 2 / 66

Page 3: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Refresh Your Understanding: Batch RL Solutions

Select all that are true:1 Batch RL refers to when we have many agents acting in a batch2 In batch RL we generally care more about sample efficiency than

computational efficiency3 Importance sampling can be used to get an unbiased estimate of policy

performance4 Q-learning can be used in batch RL and will generally provide a better

estimate than importance sampling in Markov environments for anyfunction approximator used for the Q

5 Not sure

Answer. F. T. T. F.

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 3 / 66

Page 4: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Class Structure

Last time: Professor Doshi-Velez guest lecture

This Time: MCTS

Next time: Last lecture

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 4 / 66

Page 5: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Video

AlphaGo video

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 5 / 66

Page 6: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Monte Carlo Tree Search

Why choose to have this as well?

Responsible in part for one of the greatest achievements in AI in thelast decade– becoming a better Go player than any human

Brings in ideas of model-based RL and the benefits of planning

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 6 / 66

Page 7: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Table of Contents

1 Introduction

2 Model-Based Reinforcement Learning

3 Simulation-Based Search

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 7 / 66

Page 8: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Introduction: Model-Based Reinforcement Learning

Previous lectures: For online learning, learn value function or policydirectly from experience

This lecture: For online learning, learn model directly from experience

and use planning to construct a value function or policy

Integrate learning and planning into a single architecture

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 8 / 66

emma
Page 9: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Model-Based and Model-Free RL

Model-Free RL

No modelLearn value function (and/or policy) from experience

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 9 / 66

emma
Page 10: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Model-Based and Model-Free RL

Model-Free RL

No modelLearn value function (and/or policy) from experience

Model-Based RL

Learn a model from experiencePlan value function (and/or policy) from model

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 10 / 66

emma
Page 11: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Model-Free RL

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 11 / 66

emma
emma
Page 12: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Model-Based RL

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 12 / 66

emma
Page 13: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Table of Contents

1 Introduction

2 Model-Based Reinforcement Learning

3 Simulation-Based Search

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 13 / 66

Page 14: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Model-Based RL

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 14 / 66

emma
emma
emma
emma
emma
Page 15: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Advantages of Model-Based RL

Advantages:

Can efficiently learn model by supervised learning methodsCan reason about model uncertainty (like in upper confidence boundmethods for exploration/exploitation trade offs)

Disadvantages

First learn a model, then construct a value function⇒ two sources of approximation error

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 15 / 66

emma
emma
emma
emma
Page 16: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

MDP Model Refresher

A model M is a representation of an MDP < S,A,P,R >,parametrized by η

We will assume state space S and action space A are known

So a model M =< Pη,Rη > represents state transitions Pη ≈ P andrewards Rη ≈ R

St+1 ∼ Pη(St+1 | St ,At)

Rt+1 = Rη(Rt+1 | St ,At)

Typically assume conditional independence between state transitionsand rewards

P[St+1,Rt+1 | St ,At ] = P[St+1 | St ,At ]P[Rt+1 | St ,At ]

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 16 / 66

emma
emma
emma
emma
emma
emma
emma
emma
Page 17: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Model Learning

Goal: estimate model Mη from experience {S1,A1,R2, ...,ST}This is a supervised learning problem

S1,A1 → R2, S2

S2,A2 → R3, S3...

ST−1,AT−1 → RT ,ST

Learning s, a→ r is a regression problem

Learning s, a→ s ′ is a density estimation problem

Pick loss function, e.g. mean-squared error, KL divergence, . . .

Find parameters η that minimize empirical loss

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 17 / 66

emma
emma
emma
emma
emma
Page 18: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Examples of Models

Table Lookup Model

Linear Expectation Model

Linear Gaussian Model

Gaussian Process Model

Deep Belief Network Model

. . .

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 18 / 66

emma
Page 19: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Table Lookup Model

Model is an explicit MDP, P̂, R̂Count visits N(s, a) to each state action pair

P̂as,s′ =

1

N(s, a)

T∑t=1

1(St ,At , St+1 = s, a, s ′)

R̂as =

1

N(s, a)

T∑t=1

1(St ,At = s, a)

Alternatively

At each time-step t, record experience tuple < St ,At ,Rt+1,St+1 >To sample model, randomly pick tuple matching < s, a, ·, · >

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 19 / 66

emma
emma
emma
Page 20: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

AB Example & Check Your Memory

Two states A,B; no discounting; 8 episodes of experience

We have constructed a table lookup model from the experience

Recall: For a particular policy, TD with a tabular representation withinfinite experience replay will converge to the same value as computedif construct a MLE model and do planning

Check Your Memory: Will MC methods converge to the samesolution?

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 20 / 66

emma
emma
emma
emma
emma
emma
Page 21: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

AB Example & Check Your Memory Solution

Two states A,B; no discounting; 8 episodes of experience

We have constructed a table lookup model from the experienceRecall: For a particular policy, TD with a tabular representation withinfinite experience replay will converge to the same value as computedif construct a MLE model and do planningCheck Your Memory: Will MC methods converge to the samesolution?Solution: Not necessary. MC methods will converge to solutionwhich minimizes the MSE of the value and the observed returns

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 21 / 66

emma
Page 22: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Planning with a Model

Given a model Mη =< Pη,Rη >Solve the MDP < S,A,Pη,Rη >Using favourite planning algorithm

Value iterationPolicy iterationTree search· · ·

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 22 / 66

emma
emma
emma
Page 23: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Sample-Based Planning

A simple but powerful approach to planning

Use the model only to generate samples

Sample experience from model

St+1 ∼ Pη(St+1 | St ,At)

Rt+1 = Rη(Rt+1 | St ,At)

Apply model-free RL to samples, e.g.:

Monte-Carlo controlSarsaQ-learning

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 23 / 66

emma
emma
Page 24: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Planning with an Inaccurate Model

Given an imperfect model < Pη,Rη >6=< P,R >

Performance of model-based RL is limited to optimal policy forapproximate MDP < S,A,Pη,Rη >i.e. Model-based RL is only as good as the estimated model

When the model is inaccurate, planning process will compute asub-optimal policy

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 24 / 66

emma
Page 25: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Back to the AB Example

Construct a table-lookup model from real experience

Apply model-free RL to sampled experience

Real experienceA, 0, B, 0

B, 1B, 1

What values will TD with estimated model converge to assumingγ = 1?

Is this correct?

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 25 / 66

emma
emma
emma
emma
emma
emma
Page 26: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Back to the AB Example

Construct a table-lookup model from real experience

Apply model-free RL to sampled experience

Real experienceA, 0, B, 0

B, 1B, 1

What values will TD with estimated model converge to assumingγ = 1?

V(B) = 2/3, V(A) = V(B) = 2/3

Is this correct? No

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 26 / 66

Page 27: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Planning with an Inaccurate Model

Given an imperfect model < Pη,Rη >6=< P,R >

Performance of model-based RL is limited to optimal policy forapproximate MDP < S,A,Pη,Rη >i.e. Model-based RL is only as good as the estimated model

When the model is inaccurate, planning process will compute asub-optimal policy

Solution 1: when model is wrong, use model-free RL

Solution 2: reason explicitly about model uncertainty (see Lectures onExploration / Exploitation)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 27 / 66

emma
Page 28: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Table of Contents

1 Introduction

2 Model-Based Reinforcement Learning

3 Simulation-Based Search

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 28 / 66

Page 29: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Computing Action for Current State Only

Previously would compute a policy for whole state space

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 29 / 66

emma
emma
emma
Page 30: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Simulation-Based Search

Simulate episodes of experience from now with the model startingfrom current state St

{Skt ,A

kt ,R

kt+1, ...,S

kT}Kk=1 ∼Mv

Apply model-free RL to simulated episodes

Monte-Carlo control → Monte-Carlo searchSarsa → TD search

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 30 / 66

emma
emma
emma
emma
emma
Page 31: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Simple Monte-Carlo Search

Given a model Mv and a simulation policy π

For each action a ∈ ASimulate K episodes from current (real) state st

{st , a,Rkt+1, ...,S

kT}Kk=1 ∼Mv , π

Evaluate actions by mean return (Monte-Carlo evaluation)

Q(st , a) =1

K

K∑k=1

GtP−→ qπ(st , a) (1)

Select current (real) action with maximum value

at = argmaxa∈A

Q(st , a)

This is essentially doing 1 step of policy improvement

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 31 / 66

emma
emma
emma
emma
emma
emma
emma
emma
Page 32: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Simulation-Based Search

Simulate episodes of experience from now with the model

Apply model-free RL to simulated episodes

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 32 / 66

emma
emma
emma
emma
emma
emma
Page 33: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Expectimax Tree

Can we do better than 1 step of policy improvement?

If have a MDP model Mv

Can compute optimal q(s, a) values for current state by constructingan expectimax tree

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 33 / 66

emma
Page 34: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Forward Search Expectimax Tree

Forward search algorithms select the best action by lookahead

They build a search tree with the current state st at the root

Using a model of the MDP to look ahead

No need to solve whole MDP, just sub-MDP starting from now

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 34 / 66

emma
emma
emma
emma
emma
emma
emma
emma
emma
emma
emma
emma
emma
emma
emma
emma
Page 35: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Expectimax Tree

Can we do better than 1 step of policy improvement?

If have a MDP model Mv

Can compute optimal q(s, a) values for current state by constructingan expectimax tree

Limitations: Size of tree scales as: Try to work out yourself

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 35 / 66

emma
Page 36: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Expectimax Tree

Can we do better than 1 step of policy improvement?

If have a MDP model Mv

Can compute optimal q(s, a) values for current state by constructingan expectimax tree

Limitations: Size of tree scales as (|S ||A|)H

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 36 / 66

emma
Page 37: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Monte-Carlo Tree Search (MCTS)

Given a model Mv

Build a search tree rooted at the current state st

Samples actions and next states

Iteratively construct and update tree by performing K simulationepisodes starting from the root state

After search is finished, select current (real) action with maximumvalue in search tree

at = argmaxa∈A

Q(st , a)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 37 / 66

emma
emma
emma
Page 38: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Monte-Carlo Tree Search

Goal: Compute optimal action for current state

Simulating an episode involves two phases (in-tree, out-of-tree)

Tree policy: pick actions for tree nodes to maximize Q(S ,A)Roll out policy: e.g. pick actions randomly, or another policy

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 38 / 66

emma
emma
emma
Page 39: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Monte-Carlo Tree Search

Goal: Compute optimal action for current state

Simulating an episode involves two phases (in-tree, out-of-tree)

Tree policy: pick actions for tree nodes to maximize Q(S ,A)Roll out policy: e.g. pick actions randomly, or another policy

To evaluate the value of a tree node i at state action pair (s, a),average over all rewards received from that node onwards acrosssimulated episodes in which this tree node was reached

Q(i) =1

N(i)

K∑k=1

T∑u=t

1(i ∈ epi .k)Gk(i)P−→ q(s, a)

Under mild conditions, converges to the optimal search tree,Q(S ,A)→ q∗(S ,A)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 39 / 66

emma
emma
emma
emma
emma
emma
emma
emma
emma
emma
emma
emma
Page 40: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Check Your Understanding: MCTS

MCTS involves deciding on an action to take by doing tree searchwhere it picks actions to maximize Q(S ,A) and samples states.Select all

1 Given a MDP, MCTS may be a good choice for short horizon problemswith a small number of states and actions.

2 Given a MDP, MCTS may be a good choice for long horizon problemswith a large action space and a small state space

3 Given a MDP, MCTS may be a good choice for long horizon problemswith a large state space and small action space

4 Not sure

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 40 / 66

Page 41: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Check Your Understanding: MCTS

MCTS involves deciding on an action to take by doing tree searchwhere it picks actions to maximize Q(S ,A) and samples states.Select all

1 Given a MDP, MCTS may be a good choice for short horizon problemswith a small number of states and actions.

2 Given a MDP, MCTS may be a good choice for long horizon problemswith a large action space and a small state space

3 Given a MDP, MCTS may be a good choice for long horizon problemswith a large state space and small action space

4 Not sure

Solutions: F F T

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 41 / 66

emma
emma
emma
emma
Page 42: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Upper Confidence Tree (UCT) Search

How to select what action to take during a simulated episode?

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 42 / 66

emma
Page 43: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Upper Confidence Tree (UCT) Search

How to select what action to take during a simulated episode?

UCT: borrow idea from bandit literature and treat each node wherecan select actions as a multi-armed bandit (MAB) problem

Maintain an upper confidence bound over reward of each arm

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 43 / 66

emma
emma
emma
emma
emma
emma
emma
Page 44: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Upper Confidence Tree (UCT) Search

How to select what action to take during a simulated episode?

UCT: borrow idea from bandit literature and treat each node wherecan select actions as a multi-armed bandit (MAB) problem

Maintain an upper confidence bound over reward of each arm

Q(s, a, i) =1

N(s, a, i)

K∑k=1

T∑u=t

1(i ∈ epi .k)Gk(s, a, i) + c

√ln(n(s))

n(s, a)

For simplicity can treat each state node as a separate MAB

For simulated episode k at node i , select action/arm with highestupper bound to simulate and expand (or evaluate) in the tree

aik = arg maxQ(s, a, i)

This implies that the policy used to simulate episodes with (andexpand/update the tree) can change across each episode

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 44 / 66

emma
emma
emma
emma
emma
emma
emma
emma
emma
emma
emma
emma
emma
emma
emma
Page 45: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Case Study: the Game of Go

Go is 2500 years old

Hardest classic board game

Grand challenge task (JohnMcCarthy)

Traditional game-tree searchhas failed in Go

Check your understanding:does playing Go involvelearning to make decisions ina world where dynamics andreward model are unknown?

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 45 / 66

emma
emma
emma
Page 46: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Rules of Go

Usually played on 19x19, also 13x13 or 9x9 board

Simple rules, complex strategy

Black and white place down stones alternately

Surrounded stones are captured and removed

The player with more territory wins the game

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 46 / 66

Page 47: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Position Evaluation in Go

How good is a position s

Reward function (undiscounted):

Rt = 0 for all non-terminal steps t < T

RT =

{1, if Black wins.

0, if White wins.

Policy π =< πB , πW > selects moves for both players

Value function (how good is position s):

vπ(s) = Eπ[RT | S = s] = P[Black wins | S = s]

v∗(s) = maxπB

minπW

vπ(s)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 47 / 66

emma
Page 48: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Monte-Carlo Evaluation in Go

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 48 / 66

emma
emma
emma
emma
Page 49: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Applying Monte-Carlo Tree Search (1)

Go is a 2 player game so tree is a minimax tree instead of expectimax

White minimizes future reward and Black maximizes future rewardwhen computing action to simulate

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 49 / 66

emma
emma
emma
emma
Page 50: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Applying Monte-Carlo Tree Search (2)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 50 / 66

Page 51: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Applying Monte-Carlo Tree Search (3)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 51 / 66

emma
emma
Page 52: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Applying Monte-Carlo Tree Search (4)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 52 / 66

emma
emma
emma
Page 53: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Applying Monte-Carlo Tree Search (5)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 53 / 66

emma
emma
Page 54: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Advantages of MC Tree Search

Highly selective best-first search

Evaluates states dynamically (unlike e.g. DP)

Uses sampling to break curse of dimensionality

Works for “black-box” models (only requires samples)

Computationally efficient, anytime, parallelisable

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 54 / 66

emma
Page 55: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

In more depth: Upper Confidence Tree (UCT) Search

UCT: borrow idea from bandit literature and treat each tree nodewhere can select actions as a multi-armed bandit (MAB) problem

Maintain an upper confidence bound over reward of each arm andselect the best arm

Check your understanding: Why is this slightly strange? Hint: whywere upper confidence bounds a good idea for exploration/exploitation? Is there an exploration/ exploitation problem duringsimulated episodes?1

1Relates to metalevel reasoning (for an example related to Go see ”SelectingComputations: Theory and Applications”, Hay, Russell, Tolpin and Shimony 2012)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 55 / 66

Page 56: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Check Your Understanding: UCT Search

In Upper Confidence Tree (UCT) search we treat each tree node as amulti-armed bandit (MAB) problem, and use an upper confidencebound over the future value of each action to help select actions forlater rollouts. Select all that are true

1 This may be useful since it will prioritize actions that lead to later goodrewards

2 UCB minimizes regret. UCT is minimizing regret within rollouts of thetree. (If this is true, think about if this a good idea?)

3 Not sure

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 56 / 66

Page 57: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Check Your Understanding: UCT Search Solutions

In Upper Confidence Tree (UCT) search we treat each tree node as amulti-armed bandit (MAB) problem, and use an upper confidencebound over the future value of each action to help select actions forlater rollouts. Select all that are true

1 This may be useful since it will prioritize actions that lead to later goodrewards

2 UCB minimizes regret. UCT is minimizing regret within rollouts of thetree. (If this is true, think about if this a good idea?)

3 Not sure

T. T (but not a good idea)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 57 / 66

emma
Page 58: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

In more depth: Upper Confidence Tree (UCT) Search

UCT: borrow idea from bandit literature and treat each tree nodewhere can select actions as a multi-armed bandit (MAB) problem

Maintain an upper confidence bound over reward of each arm andselect the best arm

Hint: why were upper confidence bounds a good idea for exploration/exploitation? Is there an exploration/ exploitation problem duringsimulated episodes?2

2Relates to metalevel reasoning (for an example related to Go see ”SelectingComputations: Theory and Applications”, Hay, Russell, Tolpin and Shimony 2012)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 58 / 66

Page 59: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

AlphaGo

AlphaGo trailer link

MCTS is part of what lead to success in Go

A number of key additional choices

There are an enormous number of possible moves. Prioritizing whichactions to take when branching matters. In AlphaGo this was initiallydone by training a supervised learning NN to mimic moves made byhumans.

Instead of rolling all the way out to the leaves (game won or lost),one can bootstrap and substitute an estimate of the future value.They did this in alphaGo and this also made a big difference.

For training this value estimate they used self-play. This wasextremely useful. Self-play in games means that an agent willfrequently win and lose because the other agent it is playing is of asimilar difficulty level. This is very helpful for providing enough rewardsignal.

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 59 / 66

emma
emma
emma
Page 60: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Beyond AlphaGo

There have been several more impressive insights that don’t rely onhuman knowledge:

Mastering the game of Go without human knowledge. Silver et al. Nature 2017.

A general RL algorithm that masters chess, shogi, and Go through self-play. Silver et al. Science2018.

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 60 / 66

Page 61: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

What You Should Know

MCTS computes a decision only for the current state

It is an alternative to value iteration or policy iteration and can beused in extremely large state spaces

UCT uses optimism under uncertainty to choose to further expandaction nodes that seem promising

MCTS has been used with RL to create an AI Go player that exceedsthe best human performance

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 61 / 66

Page 62: Lecture 15: MCTS [1]With many slides from or derived from ...This lecture: For online learning, learnmodeldirectly from experience and useplanningto construct a value function or policy

Class Structure

Last time: Quiz

This Time: MCTS

Next time: Last lecture

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 15: MCTS 1 Winter 2021 62 / 66