CS 598 Statistical Reinforcement...

CS 598 Statistical Reinforcement Learning

Nan Jiang

Overview

What’s this course about?

• A grad-level seminar course on theory of RL

• A grad-level seminar course on theory of RL• with focus on sample complexity analyses

• A grad-level seminar course on theory of RL• with focus on sample complexity analyses• all about proofs, some perspectives, 0 implementation

• A grad-level seminar course on theory of RL• with focus on sample complexity analyses• all about proofs, some perspectives, 0 implementation• Seminar course can be anywhere between students presenting

papers and a rigorous course

papers and a rigorous course • In this course I will deliver most (or all) of the lectures

papers and a rigorous course • In this course I will deliver most (or all) of the lectures• No text book; material is created by myself (course notes)

papers and a rigorous course • In this course I will deliver most (or all) of the lectures• No text book; material is created by myself (course notes)• will share slides/notes on course website

Who should take this course?

• This course will be a good fit for you if you…

• This course will be a good fit for you if you…• are interested in understanding RL mathematically

• This course will be a good fit for you if you…• are interested in understanding RL mathematically• want to do research in theory of RL or related area

• This course will be a good fit for you if you…• are interested in understanding RL mathematically• want to do research in theory of RL or related area• want to work with me

• This course will be a good fit for you if you…• are interested in understanding RL mathematically• want to do research in theory of RL or related area• want to work with me• are comfortable with maths

• This course will not be a good fit if you…

• This course will not be a good fit if you…• are mostly interested in implementing RL algorithms and

making things to work we won’t cover any engineering tricks (which are essential to e.g., Deep RL); check other RL courses offered across the campus

Prerequisites

• Maths

Prerequisites

• Maths• Linear algebra, probability & statistics, basic calculus

Prerequisites

• Maths• Linear algebra, probability & statistics, basic calculus• Markov chains

Prerequisites

• Maths• Linear algebra, probability & statistics, basic calculus• Markov chains• Optional: stochastic processes, numerical analysis

Prerequisites

• Maths• Linear algebra, probability & statistics, basic calculus• Markov chains• Optional: stochastic processes, numerical analysis• Useful for research: TCS background, empirical processes

and statistical learning theory, optimization, online learning

Prerequisites

and statistical learning theory, optimization, online learning• Exposure to ML

Prerequisites

• e.g., CS 446 Machine Learning

Prerequisites

• e.g., CS 446 Machine Learning• Experience with RL

Coursework

• Some readings after/before class

Coursework

• Some readings after/before class• Ad hoc homework to help digest particular material. Deadline

will be lenient & TBA at the time of assignment

Coursework

will be lenient & TBA at the time of assignment• Main assignment: course project (work on your own)

Coursework

• Baseline: reproduce theoretical analysis in existing papers

Coursework

• Baseline: reproduce theoretical analysis in existing papers• Advanced: identify an interesting/challenging extension to

the paper and explore the novel research question yourself

Coursework

• Baseline: reproduce theoretical analysis in existing papers• Advanced: identify an interesting/challenging extension to

the paper and explore the novel research question yourself• Or, just work on a novel research question (must have a

significant theoretical component; need to discuss with me)

Course project (cont.)

• See list of references and potential topics on website

• See list of references and potential topics on website• You will need to submit:

• A brief proposal (~1/2 page). Tentative deadline: end of Feb

• A brief proposal (~1/2 page). Tentative deadline: end of Feb• what’s the topic and what papers you plan to work on

• A brief proposal (~1/2 page). Tentative deadline: end of Feb• what’s the topic and what papers you plan to work on• why you choose the topic: what interest you?

• A brief proposal (~1/2 page). Tentative deadline: end of Feb• what’s the topic and what papers you plan to work on• why you choose the topic: what interest you?• which aspect(s) you will focus on?

• Final report: clarity, precision, and brevity are greatly valued. More details to come…

• All docs should be in pdf. Final report should be prepared using LaTeX.

Course project (cont. 2)

Rule of thumb

Rule of thumb1. learn something that interests you

Rule of thumb1. learn something that interests you2. teach me something! (I wouldn’t learn if I could not

understand your report due to lack of clarity)

Rule of thumb1. learn something that interests you2. teach me something! (I wouldn’t learn if I could not

understand your report due to lack of clarity)3. write a report similar to (or better than!) the notes I will share

with you

Contents of the course

• many important topics in RL will not be covered in depth (e.g., TD). read more if you want to get a more comprehensive view of RL

Contents of the course

• many important topics in RL will not be covered in depth (e.g., TD). read more if you want to get a more comprehensive view of RL

• the other opportunity to learn what’s not covered in lectures is the project, as potential topics for projects are much broader than what’s covered in class.

My goals

• Encourage you to do research in this area

My goals

• Encourage you to do research in this area• Introduce useful mathematical tools to you

My goals

• Encourage you to do research in this area• Introduce useful mathematical tools to you• Learn from you

• Course website: http://nanjiang.cs.illinois.edu/cs598/

Logistics

• Course website: http://nanjiang.cs.illinois.edu/cs598/• logistics, links to slides/notes (uploaded after lectures),

and resources (e.g., textbooks to consult, related courses), deadlines & announcements

Logistics

• Time & Location: Tue & Thu 12:30-1:45pm, 1304 Siebel.

Logistics

• Time & Location: Tue & Thu 12:30-1:45pm, 1304 Siebel.• TA: Jinglin Chen (jinglinc)

Logistics

• Time & Location: Tue & Thu 12:30-1:45pm, 1304 Siebel.• TA: Jinglin Chen (jinglinc)• Office hours: By appointment. 3322 Siebel.

Logistics

• Questions about material: ad hoc meetings after lectures subject to my availability

Logistics

• Questions about material: ad hoc meetings after lectures subject to my availability

• Other things about the class (e.g., project): by appointment

Logistics

Introduction to MDPs and RL

Reinforcement Learning (RL)Applications

[Levine et al’16] [Ng et al’03]

[Levine et al’16] [Ng et al’03] [Singh et al’02]

Reinforcement Learning (RL)

[Lei et al’12]

Applications

[Levine et al’16] [Ng et al’03][Mandel et al’16]

[Singh et al’02]

[Lei et al’12]

Applications

[Singh et al’02]

[Lei et al’12][Tesauro et al’07]

Applications

[Singh et al’02]

[Mnih et al’15][Silver et al’16][Lei et al’12]

[Tesauro et al’07]

Applications

Shortest Path

action

Greedy is suboptimal due to delayed effects

Need long-term planning

Shortest Path

action

V*(g) = 0

V*(f) = 1

Shortest Path

action

V*(g) = 0

V*(f) = 1

V*(d) = min{3 + V*(g) , 2 + V*(f)}

Shortest Path

action

Bellman Equation

V*(g) = 0

V*(f) = 1

V*(d) = 2

V*(d) = min{3 + V*(g) , 2 + V*(f)}

Shortest Path

action

Bellman Equation

V*(g) = 0

V*(f) = 1

V*(d) = 2

V*(e) = 2V*(c) = 4

V*(b) = 5

V*(s0) = 6

V*(d) = min{3 + V*(g) , 2 + V*(f)}

Shortest Path

Bellman Equation

V*(g) = 0

V*(f) = 1

V*(d) = 2

V*(e) = 2V*(c) = 4

V*(b) = 5

V*(s0) = 6

V*(d) = min{3 + V*(g) , 2 + V*(f)}

Shortest Path

Bellman Equation

Shortest Path

Stochastic

Shortest Path

0.50.5

Stochastic

Shortest Path

transition distribution

0.50.5

Markov Decision Process (MDP)

action

0.50.5

Stochastic Shortest Path

Greedy is suboptimal due to delayed effectsNeed long-term planning

0.50.5

V*(c) = min{4 + 0.7 × V*(d) + 0.3 × V*(e) , 2 + V*(e)}

Bellman Equation

0.50.5

V*(c) = min{4 + 0.7 × V*(d) + 0.3 × V*(e) , 2 + V*(e)}

V*(g) = 0

Bellman Equation

0.50.5

V*(c) = min{4 + 0.7 × V*(d) + 0.3 × V*(e) , 2 + V*(e)}

V*(g) = 0

V*(f) = 1

V*(d) = 2

V*(e) = 2.5V*(c) = 4.5

V*(b) = 6

V*(s0) = 6.5

Bellman Equation

0.50.5

V*(c) = min{4 + 0.7 × V*(d) + 0.3 × V*(e) , 2 + V*(e)}

✓: optimal policy π*

V*(g) = 0

V*(f) = 1

V*(d) = 2

V*(e) = 2.5V*(c) = 4.5

V*(b) = 6

V*(s0) = 6.5

Bellman Equation

✓✓

✓ ✓

14 0.5

via trial-and-errorStochastic Shortest Path

Trajectory 1: s0↘ c ↗ d → g

14 0.5

0.50.5

Trajectory 2: s0↘ c ↗ e → f ↗ g

via trial-and-error

Shortest PathStochastic

0.50.5

via trial-and-error

0.28s0

via trial-and-error

0.28s0

via trial-and-error

0.28s0

via trial-and-error

Model-based RL

0.28s0

via trial-and-error

How many trajectories do we need to compute a near-optimal policy?

Shortest Path

sample complexity

Stochastic

Model-based RL

via trial-and-error

• Assume states & actions are visited uniformly

via trial-and-error

• Assume states & actions are visited uniformly• #trajectories needed ≤ n ⋅ (#state-action pairs)

#samples needed to estimate a multinomial distribution

via trial-and-error

Nontrivial! Need exploration

Video game playing

state st ∈ S

Video game playing

action at ∈ A

reward rt = R (st , at)

state st ∈ S

Video game playing

e.g., random spawn of enemies

action at ∈ A

transition dynamics P ( ⋅ | st , at)

state st ∈ S

(unknown)

Video game playing

action at ∈ A

state st ∈ S

policy π: S → A

objective: maximize (or ) E⇥P

t=1 rt |⇡⇤

E⇥P1

t=1 �t�1rt |⇡

(unknown)

Video game playing

5 10 15

γ = 0.8

γ = 0.9

5 10 15

H = 10

action at ∈ A

state st ∈ S

policy π: S → A

objective: maximize (or ) E⇥P

t=1 rt |⇡⇤

E⇥P1

t=1 �t�1rt |⇡

(unknown)

Video game playing

Need generalization

Video game playing

Need generalization

Video game playing

Value function approximation

f (x;θ)

state features x

Need generalization

Video game playing

f (x;θ)

state features x

θNeed generalization

Video game playing

f (x;θ)

state features x

θFind θ s.t. f (x;θ) ≈ r + γ ⋅ Ex’|x [ f (x’;θ)]

Need generalization

⇒ f (⋅ ; θ) ≈ V*

Video game playing

f (x;θ)

state features x

Find θ s.t. f (x;θ) ≈ r + γ ⋅ Ex’|x [ f (x’;θ)]…

Need generalization

⇒ f (⋅ ; θ) ≈ V*

Video game playing

state features x

Adaptive medical treatment

• State: diagnosis • Action: treatment • Reward: progress in recovery

A Machine Learning view of RL

Lecture 1: Introduction to Reinforcement Learning

About RL

Many Faces of Reinforcement Learning

Computer Science

Economics

Mathematics

Engineering Neuroscience

Psychology

Machine Learning

Classical/OperantConditioning

Optimal Control

RewardSystem

Operations Research

BoundedRationality

Reinforcement Learning

slide credit: David Silver

Supervised Learning

Given {(x(i), y(i))}, learn f : x ↦ y

Supervised Learning

Given {(x(i), y(i))}, learn f : x ↦ y• Online version: for round t = 1, 2,…, the learner

Supervised Learning

• observes x(t)

Supervised Learning

• observes x(t)

• predicts y(t)

Supervised Learning

• observes x(t)

• predicts y(t)

• receives y(t)

Supervised Learning

• observes x(t)

• predicts y(t)

• receives y(t)

• Want to maximize # of correct predictions

Supervised Learning

• observes x(t)

• predicts y(t)

• receives y(t)

• Want to maximize # of correct predictions• e.g., classifies if an image is about a dog, a cat, a plane, etc.

(multi-class classification)

Supervised Learning

• observes x(t)

• predicts y(t)

• receives y(t)

(multi-class classification)• Dataset is fixed for everyone

Supervised Learning

• observes x(t)

• predicts y(t)

• receives y(t)

(multi-class classification)• Dataset is fixed for everyone• “Full information setting”

Supervised Learning

• observes x(t)

• predicts y(t)

• receives y(t)

(multi-class classification)• Dataset is fixed for everyone• “Full information setting”• Core challenge: generalization

Contextual bandits

For round t = 1, 2,…, the learner

Contextual bandits

For round t = 1, 2,…, the learner• Given xi , chooses from a set of actions ai ∈ A

Contextual bandits

For round t = 1, 2,…, the learner• Given xi , chooses from a set of actions ai ∈ A• Receives reward r(t) ~ R(x(t) , a(t)) (i.e., can be random)

Contextual bandits

For round t = 1, 2,…, the learner• Given xi , chooses from a set of actions ai ∈ A• Receives reward r(t) ~ R(x(t) , a(t)) (i.e., can be random)• Want to maximize total reward

Contextual bandits

For round t = 1, 2,…, the learner• Given xi , chooses from a set of actions ai ∈ A• Receives reward r(t) ~ R(x(t) , a(t)) (i.e., can be random)• Want to maximize total reward• You generate your own dataset {(x(t) , a(t) , r(t))}!

Contextual bandits

For round t = 1, 2,…, the learner• Given xi , chooses from a set of actions ai ∈ A• Receives reward r(t) ~ R(x(t) , a(t)) (i.e., can be random)• Want to maximize total reward• You generate your own dataset {(x(t) , a(t) , r(t))}!• e.g., for an image, the learner guesses a label, and is told

whether correct or not (reward = 1 if correct and 0 otherwise). Do not know what’s the true label.

Contextual bandits

• e.g., for an user, the website recommends a movie, and observes whether the user likes it or not. Do not know what movies the user really want to see.

Contextual bandits

• e.g., for an user, the website recommends a movie, and observes whether the user likes it or not. Do not know what movies the user really want to see.

• “Partial information setting”

Contextual bandits

Contextual Bandits (cont.)

Contextual bandits

Contextual Bandits (cont.)• Simplification: no x, Multi-Armed Bandits (MAB)

Contextual bandits

Contextual Bandits (cont.)• Simplification: no x, Multi-Armed Bandits (MAB)• Bandit is a research area by itself. we will not do a lot of bandits

but may go through some material that have important implications on general RL (e.g., lower bounds)

For round t = 1, 2,…,

• For time step h=1, 2, …, H, the learner

• Observes xh(t)

• Chooses ah(t)

• Observes xh(t)

• Chooses ah(t)

• Receives rh(t) ~ R(xh(t) , ah(t))

• Observes xh(t)

• Chooses ah(t)

• Next xh+1(t) is generated as a function of xh(t) and ah(t)

(or sometimes, all previous x’s and a’s within round t)

• Observes xh(t)

• Chooses ah(t)

• Bandits + “Delayed rewards/consequences”

• Observes xh(t)

• Chooses ah(t)

• Bandits + “Delayed rewards/consequences”

• The protocol here is for episodic RL (each t is an episode).

Why statistical RL?

Two types of scenarios in RL research

Why statistical RL?

1. Solving a large planning problem using a learning approach

Why statistical RL?

1. Solving a large planning problem using a learning approach- e.g., AlphaGo, video game playing, simulated robotics

Why statistical RL?

1. Solving a large planning problem using a learning approach- e.g., AlphaGo, video game playing, simulated robotics- Transition dynamics (Go rules) known, but too many states

Why statistical RL?

1. Solving a large planning problem using a learning approach- e.g., AlphaGo, video game playing, simulated robotics- Transition dynamics (Go rules) known, but too many states- Run the simulator to collect data

Why statistical RL?

2. Solving a learning problem

Why statistical RL?

2. Solving a learning problem- e.g., adaptive medical treatment

Why statistical RL?

2. Solving a learning problem- e.g., adaptive medical treatment- Transition dynamics unknown (and too many states)

Why statistical RL?

2. Solving a learning problem- e.g., adaptive medical treatment- Transition dynamics unknown (and too many states)- Interact with the environment to collect data

Why statistical RL?

• I am more interested in #2. More challenging in some aspects.

Why statistical RL?

• I am more interested in #2. More challenging in some aspects.• Data (real-world interactions) is highest priority. Computation

second.

Why statistical RL?

second. • Even for #1, sample complexity lower-bounds computational

complexity — statistical-first approach is reasonable.

Why statistical RL?

second. • Even for #1, sample complexity lower-bounds computational

complexity — statistical-first approach is reasonable.• caveat to this argument: you can do a lot more in a simulator;

see http://hunch.net/?p=8825714

Why statistical RL?

MDP Planning

Infinite-horizon discounted MDPs

An MDP M = (S, A, P, R, γ)

• State space S.

• Action space A.

• State space S.

• Action space A.

We will only consider discrete and finite spaces in this course.

• State space S.

• Action space A.

• Transition function P : S×A→∆(S). ∆(S) is the probability simplex over S, i.e., all non-negative vectors of length |S| that sums up to 1

• State space S.

• Action space A.

• Reward function R: S×A→ℝ. (deterministic reward function)

• State space S.

• Action space A.

• Discount factor γ ∈ [0,1)

• State space S.

• Action space A.

• Discount factor γ ∈ [0,1)

• The agent starts in some state s1, takes action a1, receives reward r2 ~ R(s1, a1), transitions to s2 ~ P(s1, a1), takes action a2, so on so forth — the process continues indefinitely

• Want to take actions in a way that maximizes value (or return):

Value and policy

𝔼 [∑∞t=1 γt−1rt]

• This value depends on where you start and how you act

Value and policy

𝔼 [∑∞t=1 γt−1rt]

• This value depends on where you start and how you act• Often assume boundedness of rewards:

Value and policy

rt ∈ [0, Rmax]

𝔼 [∑∞t=1 γt−1rt]

• What’s the range of ?

Value and policy

rt ∈ [0, Rmax]𝔼 [∑∞

t=1 γt−1rt]

𝔼 [∑∞t=1 γt−1rt]

Value and policy

rt ∈ [0, Rmax]𝔼 [∑∞

t=1 γt−1rt] [0,Rmax

1 − γ ]

𝔼 [∑∞t=1 γt−1rt]

• A (deterministic) policy π: S→A describes how the agent acts: at state st, chooses action at = π(st).

Value and policy

rt ∈ [0, Rmax]𝔼 [∑∞

1 − γ ]

𝔼 [∑∞t=1 γt−1rt]

• More generally, the agent may choose actions randomly (π: S→ ∆(A)), or even in a way that varies across time steps (“non-stationary policies”)

Value and policy

rt ∈ [0, Rmax]𝔼 [∑∞

1 − γ ]

𝔼 [∑∞t=1 γt−1rt]

• More generally, the agent may choose actions randomly (π: S→ ∆(A)), or even in a way that varies across time steps (“non-stationary policies”)

• Define

Value and policy

rt ∈ [0, Rmax]𝔼 [∑∞

1 − γ ]

𝔼 [∑∞t=1 γt−1rt]

Vπ(s) = 𝔼 [∑∞t=1 γt−1rt s1 = s, π ]

Bellman equation for policy evaluation

V ⇡(s) = E" 1X

�t�1rt��s1 = s,⇡

= E"r1 +

�t�1rt��s1 = s,⇡

= R(s,⇡(s)) +X

s02SP (s0|s,⇡(s))E

�t�2rt��s1 = s, s2 = s0,⇡

= R(s,⇡(s)) +X

s02SP (s0|s,⇡(s))E

�t�2rt��s2 = s0,⇡

= R(s,⇡(s)) + �X

s02SP (s0|s,⇡(s))E

�t�1rt��s1 = s0,⇡

= R(s,⇡(s)) + �X

s02SP (s0|s,⇡(s))V ⇡(s0)

= R(s,⇡(s)) + �hP (·|s,⇡(s)), V ⇡(·)i<latexit sha1_base64="EOqUViW3eaQRlDRi0wywMhhOOhU=">AAAFNniczVRbaxNBFJ52V63xluqjLweDJsE07AZBXwpFEXzwIV6SFjLJMjuZbIbOXpiZFcJ2+6d88Xf41hcfFPHVn+DsJoaYVqUFwYFhD+fyfd85nB0/EVxpxznZ2LTsS5evbF2tXLt+4+at6vbtvopTSVmPxiKWBz5RTPCI9TTXgh0kkpHQF2zfP3xWxPffMal4HL3Vs4QNQxJEfMIp0cblbVsv+6MMJzxvqCbswoNjAMAh0VPfz57nWLCJHmCVhl6md918hHk00TPAAQlDMsr0jpuD9DQcY58HR6A814CoFhhEwOYjeTDVQ4wrv4GWpuAh/CToXJQAlgyvG6plYqab5hJY1Q1qyUyJyN7kOXQbqn40RykzC6T1pksFf5TWOVua8jqFUf8/RZ5T2yrBhSTCubfnH4pb7nq9+TdQQaJAMAOC6TjWKzitVaAy2DR7Lct0r1pz2k554LThLowaWpyuV/2IxzFNQxZpKohSA9dJ9DAjUnMqWF7BqWIJoYckYANjRiRkapiVv30O941nDJNYmhtpKL2rFRkJlZqFvsks5qLWY4XzrNgg1ZMnw4xHSapZROdEk1SAjqF4Q2DMJaNazIxBqORGK9ApkYRq89JUzBDc9ZZPG/1O23Xa7qtHtb2ni3FsobvoHmogFz1Ge+gF6qIeotZ768T6bH2xP9if7K/2t3nq5sai5g765djffwCC5qb0</latexit><latexit sha1_base64="EOqUViW3eaQRlDRi0wywMhhOOhU=">AAAFNniczVRbaxNBFJ52V63xluqjLweDJsE07AZBXwpFEXzwIV6SFjLJMjuZbIbOXpiZFcJ2+6d88Xf41hcfFPHVn+DsJoaYVqUFwYFhD+fyfd85nB0/EVxpxznZ2LTsS5evbF2tXLt+4+at6vbtvopTSVmPxiKWBz5RTPCI9TTXgh0kkpHQF2zfP3xWxPffMal4HL3Vs4QNQxJEfMIp0cblbVsv+6MMJzxvqCbswoNjAMAh0VPfz57nWLCJHmCVhl6md918hHk00TPAAQlDMsr0jpuD9DQcY58HR6A814CoFhhEwOYjeTDVQ4wrv4GWpuAh/CToXJQAlgyvG6plYqab5hJY1Q1qyUyJyN7kOXQbqn40RykzC6T1pksFf5TWOVua8jqFUf8/RZ5T2yrBhSTCubfnH4pb7nq9+TdQQaJAMAOC6TjWKzitVaAy2DR7Lct0r1pz2k554LThLowaWpyuV/2IxzFNQxZpKohSA9dJ9DAjUnMqWF7BqWIJoYckYANjRiRkapiVv30O941nDJNYmhtpKL2rFRkJlZqFvsks5qLWY4XzrNgg1ZMnw4xHSapZROdEk1SAjqF4Q2DMJaNazIxBqORGK9ApkYRq89JUzBDc9ZZPG/1O23Xa7qtHtb2ni3FsobvoHmogFz1Ge+gF6qIeotZ768T6bH2xP9if7K/2t3nq5sai5g765djffwCC5qb0</latexit><latexit sha1_base64="EOqUViW3eaQRlDRi0wywMhhOOhU=">AAAFNniczVRbaxNBFJ52V63xluqjLweDJsE07AZBXwpFEXzwIV6SFjLJMjuZbIbOXpiZFcJ2+6d88Xf41hcfFPHVn+DsJoaYVqUFwYFhD+fyfd85nB0/EVxpxznZ2LTsS5evbF2tXLt+4+at6vbtvopTSVmPxiKWBz5RTPCI9TTXgh0kkpHQF2zfP3xWxPffMal4HL3Vs4QNQxJEfMIp0cblbVsv+6MMJzxvqCbswoNjAMAh0VPfz57nWLCJHmCVhl6md918hHk00TPAAQlDMsr0jpuD9DQcY58HR6A814CoFhhEwOYjeTDVQ4wrv4GWpuAh/CToXJQAlgyvG6plYqab5hJY1Q1qyUyJyN7kOXQbqn40RykzC6T1pksFf5TWOVua8jqFUf8/RZ5T2yrBhSTCubfnH4pb7nq9+TdQQaJAMAOC6TjWKzitVaAy2DR7Lct0r1pz2k554LThLowaWpyuV/2IxzFNQxZpKohSA9dJ9DAjUnMqWF7BqWIJoYckYANjRiRkapiVv30O941nDJNYmhtpKL2rFRkJlZqFvsks5qLWY4XzrNgg1ZMnw4xHSapZROdEk1SAjqF4Q2DMJaNazIxBqORGK9ApkYRq89JUzBDc9ZZPG/1O23Xa7qtHtb2ni3FsobvoHmogFz1Ge+gF6qIeotZ768T6bH2xP9if7K/2t3nq5sai5g765djffwCC5qb0</latexit><latexit sha1_base64="EOqUViW3eaQRlDRi0wywMhhOOhU=">AAAFNniczVRbaxNBFJ52V63xluqjLweDJsE07AZBXwpFEXzwIV6SFjLJMjuZbIbOXpiZFcJ2+6d88Xf41hcfFPHVn+DsJoaYVqUFwYFhD+fyfd85nB0/EVxpxznZ2LTsS5evbF2tXLt+4+at6vbtvopTSVmPxiKWBz5RTPCI9TTXgh0kkpHQF2zfP3xWxPffMal4HL3Vs4QNQxJEfMIp0cblbVsv+6MMJzxvqCbswoNjAMAh0VPfz57nWLCJHmCVhl6md918hHk00TPAAQlDMsr0jpuD9DQcY58HR6A814CoFhhEwOYjeTDVQ4wrv4GWpuAh/CToXJQAlgyvG6plYqab5hJY1Q1qyUyJyN7kOXQbqn40RykzC6T1pksFf5TWOVua8jqFUf8/RZ5T2yrBhSTCubfnH4pb7nq9+TdQQaJAMAOC6TjWKzitVaAy2DR7Lct0r1pz2k554LThLowaWpyuV/2IxzFNQxZpKohSA9dJ9DAjUnMqWF7BqWIJoYckYANjRiRkapiVv30O941nDJNYmhtpKL2rFRkJlZqFvsks5qLWY4XzrNgg1ZMnw4xHSapZROdEk1SAjqF4Q2DMJaNazIxBqORGK9ApkYRq89JUzBDc9ZZPG/1O23Xa7qtHtb2ni3FsobvoHmogFz1Ge+gF6qIeotZ768T6bH2xP9if7K/2t3nq5sai5g765djffwCC5qb0</latexit>

V ⇡(s) = E" 1X

�t�1rt��s1 = s,⇡

= E"r1 +

�t�1rt��s1 = s,⇡

= R(s,⇡(s)) +X

s02SP (s0|s,⇡(s))E

�t�2rt��s1 = s, s2 = s0,⇡

= R(s,⇡(s)) +X

s02SP (s0|s,⇡(s))E

�t�2rt��s2 = s0,⇡

= R(s,⇡(s)) + �X

s02SP (s0|s,⇡(s))E

�t�1rt��s1 = s0,⇡

= R(s,⇡(s)) + �X

s02SP (s0|s,⇡(s))V ⇡(s0)

= R(s,⇡(s)) + �hP (·|s,⇡(s)), V ⇡(·)i<latexit sha1_base64="EOqUViW3eaQRlDRi0wywMhhOOhU=">AAAFNniczVRbaxNBFJ52V63xluqjLweDJsE07AZBXwpFEXzwIV6SFjLJMjuZbIbOXpiZFcJ2+6d88Xf41hcfFPHVn+DsJoaYVqUFwYFhD+fyfd85nB0/EVxpxznZ2LTsS5evbF2tXLt+4+at6vbtvopTSVmPxiKWBz5RTPCI9TTXgh0kkpHQF2zfP3xWxPffMal4HL3Vs4QNQxJEfMIp0cblbVsv+6MMJzxvqCbswoNjAMAh0VPfz57nWLCJHmCVhl6md918hHk00TPAAQlDMsr0jpuD9DQcY58HR6A814CoFhhEwOYjeTDVQ4wrv4GWpuAh/CToXJQAlgyvG6plYqab5hJY1Q1qyUyJyN7kOXQbqn40RykzC6T1pksFf5TWOVua8jqFUf8/RZ5T2yrBhSTCubfnH4pb7nq9+TdQQaJAMAOC6TjWKzitVaAy2DR7Lct0r1pz2k554LThLowaWpyuV/2IxzFNQxZpKohSA9dJ9DAjUnMqWF7BqWIJoYckYANjRiRkapiVv30O941nDJNYmhtpKL2rFRkJlZqFvsks5qLWY4XzrNgg1ZMnw4xHSapZROdEk1SAjqF4Q2DMJaNazIxBqORGK9ApkYRq89JUzBDc9ZZPG/1O23Xa7qtHtb2ni3FsobvoHmogFz1Ge+gF6qIeotZ768T6bH2xP9if7K/2t3nq5sai5g765djffwCC5qb0</latexit><latexit sha1_base64="EOqUViW3eaQRlDRi0wywMhhOOhU=">AAAFNniczVRbaxNBFJ52V63xluqjLweDJsE07AZBXwpFEXzwIV6SFjLJMjuZbIbOXpiZFcJ2+6d88Xf41hcfFPHVn+DsJoaYVqUFwYFhD+fyfd85nB0/EVxpxznZ2LTsS5evbF2tXLt+4+at6vbtvopTSVmPxiKWBz5RTPCI9TTXgh0kkpHQF2zfP3xWxPffMal4HL3Vs4QNQxJEfMIp0cblbVsv+6MMJzxvqCbswoNjAMAh0VPfz57nWLCJHmCVhl6md918hHk00TPAAQlDMsr0jpuD9DQcY58HR6A814CoFhhEwOYjeTDVQ4wrv4GWpuAh/CToXJQAlgyvG6plYqab5hJY1Q1qyUyJyN7kOXQbqn40RykzC6T1pksFf5TWOVua8jqFUf8/RZ5T2yrBhSTCubfnH4pb7nq9+TdQQaJAMAOC6TjWKzitVaAy2DR7Lct0r1pz2k554LThLowaWpyuV/2IxzFNQxZpKohSA9dJ9DAjUnMqWF7BqWIJoYckYANjRiRkapiVv30O941nDJNYmhtpKL2rFRkJlZqFvsks5qLWY4XzrNgg1ZMnw4xHSapZROdEk1SAjqF4Q2DMJaNazIxBqORGK9ApkYRq89JUzBDc9ZZPG/1O23Xa7qtHtb2ni3FsobvoHmogFz1Ge+gF6qIeotZ768T6bH2xP9if7K/2t3nq5sai5g765djffwCC5qb0</latexit><latexit sha1_base64="EOqUViW3eaQRlDRi0wywMhhOOhU=">AAAFNniczVRbaxNBFJ52V63xluqjLweDJsE07AZBXwpFEXzwIV6SFjLJMjuZbIbOXpiZFcJ2+6d88Xf41hcfFPHVn+DsJoaYVqUFwYFhD+fyfd85nB0/EVxpxznZ2LTsS5evbF2tXLt+4+at6vbtvopTSVmPxiKWBz5RTPCI9TTXgh0kkpHQF2zfP3xWxPffMal4HL3Vs4QNQxJEfMIp0cblbVsv+6MMJzxvqCbswoNjAMAh0VPfz57nWLCJHmCVhl6md918hHk00TPAAQlDMsr0jpuD9DQcY58HR6A814CoFhhEwOYjeTDVQ4wrv4GWpuAh/CToXJQAlgyvG6plYqab5hJY1Q1qyUyJyN7kOXQbqn40RykzC6T1pksFf5TWOVua8jqFUf8/RZ5T2yrBhSTCubfnH4pb7nq9+TdQQaJAMAOC6TjWKzitVaAy2DR7Lct0r1pz2k554LThLowaWpyuV/2IxzFNQxZpKohSA9dJ9DAjUnMqWF7BqWIJoYckYANjRiRkapiVv30O941nDJNYmhtpKL2rFRkJlZqFvsks5qLWY4XzrNgg1ZMnw4xHSapZROdEk1SAjqF4Q2DMJaNazIxBqORGK9ApkYRq89JUzBDc9ZZPG/1O23Xa7qtHtb2ni3FsobvoHmogFz1Ge+gF6qIeotZ768T6bH2xP9if7K/2t3nq5sai5g765djffwCC5qb0</latexit><latexit sha1_base64="EOqUViW3eaQRlDRi0wywMhhOOhU=">AAAFNniczVRbaxNBFJ52V63xluqjLweDJsE07AZBXwpFEXzwIV6SFjLJMjuZbIbOXpiZFcJ2+6d88Xf41hcfFPHVn+DsJoaYVqUFwYFhD+fyfd85nB0/EVxpxznZ2LTsS5evbF2tXLt+4+at6vbtvopTSVmPxiKWBz5RTPCI9TTXgh0kkpHQF2zfP3xWxPffMal4HL3Vs4QNQxJEfMIp0cblbVsv+6MMJzxvqCbswoNjAMAh0VPfz57nWLCJHmCVhl6md918hHk00TPAAQlDMsr0jpuD9DQcY58HR6A814CoFhhEwOYjeTDVQ4wrv4GWpuAh/CToXJQAlgyvG6plYqab5hJY1Q1qyUyJyN7kOXQbqn40RykzC6T1pksFf5TWOVua8jqFUf8/RZ5T2yrBhSTCubfnH4pb7nq9+TdQQaJAMAOC6TjWKzitVaAy2DR7Lct0r1pz2k554LThLowaWpyuV/2IxzFNQxZpKohSA9dJ9DAjUnMqWF7BqWIJoYckYANjRiRkapiVv30O941nDJNYmhtpKL2rFRkJlZqFvsks5qLWY4XzrNgg1ZMnw4xHSapZROdEk1SAjqF4Q2DMJaNazIxBqORGK9ApkYRq89JUzBDc9ZZPG/1O23Xa7qtHtb2ni3FsobvoHmogFz1Ge+gF6qIeotZ768T6bH2xP9if7K/2t3nq5sai5g765djffwCC5qb0</latexit>

Matrix form: define

• Vπ as the |S|×1 vector [Vπ(s)]s∈S