CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

CS 598 Statistical Reinforcement Learning

Nan Jiang

Overview

What’s this course about?

• A grad-level seminar course on theory of RL

3


• A grad-level seminar course on theory of RL• with focus on sample complexity analyses

3


• A grad-level seminar course on theory of RL• with focus on sample complexity analyses• all about proofs, some perspectives, 0 implementation

3


• A grad-level seminar course on theory of RL• with focus on sample complexity analyses• all about proofs, some perspectives, 0 implementation• Seminar course can be anywhere between students presenting

papers and a rigorous course

3



papers and a rigorous course • In this course I will deliver most (or all) of the lectures

3



papers and a rigorous course • In this course I will deliver most (or all) of the lectures• No text book; material is created by myself (course notes)

3



papers and a rigorous course • In this course I will deliver most (or all) of the lectures• No text book; material is created by myself (course notes)• will share slides/notes on course website

3

Who should take this course?

• This course will be a good fit for you if you…

4


• This course will be a good fit for you if you…• are interested in understanding RL mathematically

4


• This course will be a good fit for you if you…• are interested in understanding RL mathematically• want to do research in theory of RL or related area

4


• This course will be a good fit for you if you…• are interested in understanding RL mathematically• want to do research in theory of RL or related area• want to work with me

4


• This course will be a good fit for you if you…• are interested in understanding RL mathematically• want to do research in theory of RL or related area• want to work with me• are comfortable with maths

4



• This course will not be a good fit if you…

4



• This course will not be a good fit if you…• are mostly interested in implementing RL algorithms and

making things to work we won’t cover any engineering tricks (which are essential to e.g., Deep RL); check other RL courses offered across the campus

4

Prerequisites

• Maths

5

Prerequisites

• Maths• Linear algebra, probability & statistics, basic calculus

5

Prerequisites

• Maths• Linear algebra, probability & statistics, basic calculus• Markov chains

5

Prerequisites

• Maths• Linear algebra, probability & statistics, basic calculus• Markov chains• Optional: stochastic processes, numerical analysis

5

Prerequisites

• Maths• Linear algebra, probability & statistics, basic calculus• Markov chains• Optional: stochastic processes, numerical analysis• Useful for research: TCS background, empirical processes

and statistical learning theory, optimization, online learning

5

Prerequisites


and statistical learning theory, optimization, online learning• Exposure to ML

5

Prerequisites



• e.g., CS 446 Machine Learning

5

Prerequisites



• e.g., CS 446 Machine Learning• Experience with RL

5

Coursework

• Some readings after/before class

6

Coursework

• Some readings after/before class• Ad hoc homework to help digest particular material. Deadline

will be lenient & TBA at the time of assignment

6

Coursework


will be lenient & TBA at the time of assignment• Main assignment: course project (work on your own)

6

Coursework



• Baseline: reproduce theoretical analysis in existing papers

6

Coursework



• Baseline: reproduce theoretical analysis in existing papers• Advanced: identify an interesting/challenging extension to

the paper and explore the novel research question yourself

6

Coursework



• Baseline: reproduce theoretical analysis in existing papers• Advanced: identify an interesting/challenging extension to

the paper and explore the novel research question yourself• Or, just work on a novel research question (must have a

significant theoretical component; need to discuss with me)

6

Course project (cont.)

• See list of references and potential topics on website

7


• See list of references and potential topics on website• You will need to submit:

7



• A brief proposal (~1/2 page). Tentative deadline: end of Feb

7



• A brief proposal (~1/2 page). Tentative deadline: end of Feb• what’s the topic and what papers you plan to work on

7



• A brief proposal (~1/2 page). Tentative deadline: end of Feb• what’s the topic and what papers you plan to work on• why you choose the topic: what interest you?

7



• A brief proposal (~1/2 page). Tentative deadline: end of Feb• what’s the topic and what papers you plan to work on• why you choose the topic: what interest you?• which aspect(s) you will focus on?

7




• Final report: clarity, precision, and brevity are greatly valued. More details to come…

7




• Final report: clarity, precision, and brevity are greatly valued. More details to come…

• All docs should be in pdf. Final report should be prepared using LaTeX.

7

Course project (cont. 2)

Rule of thumb

8


Rule of thumb1. learn something that interests you

8


Rule of thumb1. learn something that interests you2. teach me something! (I wouldn’t learn if I could not

understand your report due to lack of clarity)

8


Rule of thumb1. learn something that interests you2. teach me something! (I wouldn’t learn if I could not

understand your report due to lack of clarity)3. write a report similar to (or better than!) the notes I will share

with you

8

Contents of the course

• many important topics in RL will not be covered in depth (e.g., TD). read more if you want to get a more comprehensive view of RL

9

Contents of the course

• many important topics in RL will not be covered in depth (e.g., TD). read more if you want to get a more comprehensive view of RL

• the other opportunity to learn what’s not covered in lectures is the project, as potential topics for projects are much broader than what’s covered in class.

9

My goals

• Encourage you to do research in this area

10

My goals

• Encourage you to do research in this area• Introduce useful mathematical tools to you

10

My goals

• Encourage you to do research in this area• Introduce useful mathematical tools to you• Learn from you

10

• Course website: http://nanjiang.cs.illinois.edu/cs598/

11

Logistics

http://nanjiang.cs.illinois.edu/cs598/

• Course website: http://nanjiang.cs.illinois.edu/cs598/• logistics, links to slides/notes (uploaded after lectures),

and resources (e.g., textbooks to consult, related courses), deadlines & announcements

11

Logistics




• Time & Location: Tue & Thu 12:30-1:45pm, 1304 Siebel.

11

Logistics




• Time & Location: Tue & Thu 12:30-1:45pm, 1304 Siebel.• TA: Jinglin Chen (jinglinc)

11

Logistics




• Time & Location: Tue & Thu 12:30-1:45pm, 1304 Siebel.• TA: Jinglin Chen (jinglinc)• Office hours: By appointment. 3322 Siebel.

11

Logistics





• Questions about material: ad hoc meetings after lectures subject to my availability

11

Logistics





• Questions about material: ad hoc meetings after lectures subject to my availability

• Other things about the class (e.g., project): by appointment

11

Logistics


Introduction to MDPs and RL

13

Reinforcement Learning (RL)Applications

13

[Levine et al’16] [Ng et al’03]


13

[Levine et al’16] [Ng et al’03] [Singh et al’02]


13

[Levine et al’16] [Ng et al’03] [Singh et al’02]

Reinforcement Learning (RL)

[Lei et al’12]

Applications

13

[Levine et al’16] [Ng et al’03][Mandel et al’16]

[Singh et al’02]


[Lei et al’12]

Applications

13


[Singh et al’02]


[Lei et al’12][Tesauro et al’07]

Applications

13


[Singh et al’02]


[Mnih et al’15][Silver et al’16][Lei et al’12]

[Tesauro et al’07]

Applications

14

s0

b

c

d

e

f

g1

2

3

2

1

3

14

1

Shortest Path

14

s0

b

c

d

e

f

g1

2

3

2

1

3

14

1

Shortest Path

state

action

14

s0

b

c

d

e

f

g1

2

3

2

1

3

14

Greedy is suboptimal due to delayed effects

1

Need long-term planning

Shortest Path

state

action

14

s0

b

c

d

e

f

g1

2

3

2

1

3

14


1

V*(g) = 0

V*(f) = 1


Shortest Path

state

action

14

s0

b

c

d

e

f

g1

2

3

2

1

3

14


1

V*(g) = 0

V*(f) = 1

V*(d) = min{3 + V*(g) , 2 + V*(f)}


Shortest Path

state

action

Bellman Equation

14

s0

b

c

d

e

f

g1

2

3

2

1

3

14


1

V*(g) = 0

V*(f) = 1

V*(d) = 2

V*(d) = min{3 + V*(g) , 2 + V*(f)}


Shortest Path

state

action

Bellman Equation

14

s0

b

c

d

e

f

g1

2

3

2

1

3

14


1

V*(g) = 0

V*(f) = 1

V*(d) = 2

V*(e) = 2V*(c) = 4

V*(b) = 5

V*(s0) = 6

V*(d) = min{3 + V*(g) , 2 + V*(f)}


Shortest Path

Bellman Equation

14

s0

b

c

d

e

f

g1

2

3

2

1

3

14


1

V*(g) = 0

V*(f) = 1

V*(d) = 2

V*(e) = 2V*(c) = 4

V*(b) = 5

V*(s0) = 6

V*(d) = min{3 + V*(g) , 2 + V*(f)}


Shortest Path

Bellman Equation

15

s0

b

c

d

e

f

g1

2

2

1

3

14

1

Shortest Path

4

16

s0

b

c

d

e

f

g0.7

0.3

Stochastic

1

2

2

1

3

14

1

Shortest Path

e

0.50.5

4

16

s0

b

c

d

e

f

g0.7

0.3

Stochastic

1

2

2

1

3

14

1

Shortest Path

transition distribution

e

0.50.5

Markov Decision Process (MDP)

4

state

action

1

17

s0

b

c

d

e

f

g

0.50.5

1

2

4

2

1

3

14

0.7

0.3

Stochastic Shortest Path

Greedy is suboptimal due to delayed effectsNeed long-term planning

1

17

s0

b

c

d

e

f

g

0.50.5

1

2

4

2

1

3

14

0.7

0.3

V*(c) = min{4 + 0.7 × V*(d) + 0.3 × V*(e) , 2 + V*(e)}


Bellman Equation


1

17

s0

b

c

d

e

f

g

0.50.5

1

2

4

2

1

3

14

0.7

0.3

V*(c) = min{4 + 0.7 × V*(d) + 0.3 × V*(e) , 2 + V*(e)}


V*(g) = 0

Bellman Equation


1

17

s0

b

c

d

e

f

g

0.50.5

1

2

4

2

1

3

14

0.7

0.3

V*(c) = min{4 + 0.7 × V*(d) + 0.3 × V*(e) , 2 + V*(e)}


V*(g) = 0

V*(f) = 1

V*(d) = 2

V*(e) = 2.5V*(c) = 4.5

V*(b) = 6

V*(s0) = 6.5

Bellman Equation


1

17

s0

b

c

d

e

f

g

0.50.5

1

2

4

2

1

3

14

0.7

0.3

V*(c) = min{4 + 0.7 × V*(d) + 0.3 × V*(e) , 2 + V*(e)}


✓: optimal policy π*

V*(g) = 0

V*(f) = 1

V*(d) = 2

V*(e) = 2.5V*(c) = 4.5

V*(b) = 6

V*(s0) = 6.5

Bellman Equation


✓✓

✓

✓

✓ ✓

?

18

s0

b

c

d

e

f

g

1

1

2

4

2

1

3

14 0.5

0.5

0.7

0.3

via trial-and-errorStochastic Shortest Path

Trajectory 1: s0↘ c ↗ d → g

18

s0

b

c

d

e

f

g

1

1

2

4

2

1

3

14 0.5

0.5

0.7

0.3



18

s0

b

c

d

e

f

g

1

1

2

4

2

1

3

14 0.5

0.5

0.7

0.3



18

s0

b

c

d

e

f

g

1

1

2

4

2

1

3

14 0.5

0.5

0.7

0.3



18

s0

b

c

d

e

f

g

1

1

2

4

2

1

3

14 0.5

0.5

0.7

0.3



18

s0

b

c

d

e

f

g

1

1

2

4

2

1

3

14 0.5

0.5

0.7

0.3



18

s0

b

c

d

e

f

g

1

1

2

4

2

1

3

14 0.5

0.5

0.7

0.3


1

2

4

2

1

3

14

1

19

s0

b

c

d

e

f

g

0.50.5

0.7

0.3

Trajectory 2: s0↘ c ↗ e → f ↗ g

via trial-and-error


Shortest PathStochastic

1

2

4

2

1

3

14

1

19

s0

b

c

d

e

f

g

0.50.5

0.7

0.3s0

c ee

f

g


via trial-and-error



20

…

s0

b

c

d

e

f

g1

2

4

2

1

3

14

1

via trial-and-error




20

…

s0

b

c

d

e

f

g1

2

4

2

1

3

14

1

via trial-and-error




20

…

0.72

0.28s0

b

c

d

e

f

g1

2

4

2

1

3

14

1

via trial-and-error




20

…

0.72

0.28s0

b

c

d

e

f

g

0.55

0.45

1

2

4

2

1

3

14

1

via trial-and-error




20

…

0.72

0.28s0

b

c

d

e

f

g

0.55

0.45

1

2

4

2

1

3

14

1

via trial-and-error




Model-based RL

20

…

0.72

0.28s0

b

c

d

e

f

g

0.55

0.45

1

2

4

2

1

3

14

1

via trial-and-error

How many trajectories do we need to compute a near-optimal policy?



Shortest Path

sample complexity

Stochastic

Model-based RL

21

s0

b

c

d

e

f

g1

2

4

2

1

3

14

1

via trial-and-error


• Assume states & actions are visited uniformly


21

s0

b

c

d

e

f

g1

2

4

2

1

3

14

1

via trial-and-error


• Assume states & actions are visited uniformly• #trajectories needed ≤ n ⋅ (#state-action pairs)

#samples needed to estimate a multinomial distribution


21

via trial-and-error




s0

b

c

d

e

f

g1

2

2

1

3

14

1

?

??

?


4

21

via trial-and-error




s0

b

c

d

e

f

g1

2

2

1

3

14

1

Nontrivial! Need exploration

?

??

?


4

22

Video game playing

state st ∈ S

22

Video game playing

action at ∈ A

+20

reward rt = R (st , at)

state st ∈ S

22

Video game playing

e.g., random spawn of enemies

action at ∈ A

+20


transition dynamics P ( ⋅ | st , at)

state st ∈ S

(unknown)

22

Video game playing


action at ∈ A

+20



state st ∈ S

policy π: S → A

objective: maximize (or ) E⇥P

H

t=1 rt |⇡⇤

E⇥P1

t=1 �t�1rt |⇡

⇤

(unknown)

22

Video game playing

5 10 15

γ = 0.8

γ = 0.9

5 10 15

H = 5

H = 10


action at ∈ A

+20



state st ∈ S

policy π: S → A

objective: maximize (or ) E⇥P

H

t=1 rt |⇡⇤

E⇥P1

t=1 �t�1rt |⇡

⇤

(unknown)

23

Video game playing

23

Need generalization

Video game playing

23

Need generalization

Video game playing

Value function approximation

24

f (x;θ)

state features x

Need generalization

Video game playing


f (x;θ)

24

f (x;θ)

state features x

state features x

θNeed generalization

Video game playing


f (x;θ)

24

f (x;θ)

state features x

state features x

θFind θ s.t. f (x;θ) ≈ r + γ ⋅ Ex’|x [ f (x’;θ)]

…

Need generalization

⇒ f (⋅ ; θ) ≈ V*

Video game playing


f (x;θ)

24

f (x;θ)

state features x

state features x

θ

x

x’

+r

Find θ s.t. f (x;θ) ≈ r + γ ⋅ Ex’|x [ f (x’;θ)]…

Need generalization

⇒ f (⋅ ; θ) ≈ V*

Video game playing


25

value

state features x

Adaptive medical treatment

• State: diagnosis • Action: treatment • Reward: progress in recovery

A Machine Learning view of RL

27

Lecture 1: Introduction to Reinforcement Learning

About RL

Many Faces of Reinforcement Learning

Computer Science

Economics

Mathematics

Engineering Neuroscience

Psychology

Machine Learning

Classical/OperantConditioning

Optimal Control

RewardSystem

Operations Research

BoundedRationality

Reinforcement Learning

slide credit: David Silver

Supervised Learning

Given {(x(i), y(i))}, learn f : x ↦ y

28

Supervised Learning

Given {(x(i), y(i))}, learn f : x ↦ y• Online version: for round t = 1, 2,…, the learner

28

Supervised Learning


• observes x(t)

28

^

Supervised Learning


• observes x(t)

• predicts y(t)

28

^

Supervised Learning


• observes x(t)

• predicts y(t)

• receives y(t)

28

^

Supervised Learning


• observes x(t)

• predicts y(t)

• receives y(t)

• Want to maximize # of correct predictions

28

^

Supervised Learning


• observes x(t)

• predicts y(t)

• receives y(t)

• Want to maximize # of correct predictions• e.g., classifies if an image is about a dog, a cat, a plane, etc.

(multi-class classification)

28

^

Supervised Learning


• observes x(t)

• predicts y(t)

• receives y(t)


(multi-class classification)• Dataset is fixed for everyone

28

^

Supervised Learning


• observes x(t)

• predicts y(t)

• receives y(t)


(multi-class classification)• Dataset is fixed for everyone• “Full information setting”

28

^

Supervised Learning


• observes x(t)

• predicts y(t)

• receives y(t)


(multi-class classification)• Dataset is fixed for everyone• “Full information setting”• Core challenge: generalization

28

Contextual bandits

For round t = 1, 2,…, the learner

29

Contextual bandits

For round t = 1, 2,…, the learner• Given xi , chooses from a set of actions ai ∈ A

29

Contextual bandits

For round t = 1, 2,…, the learner• Given xi , chooses from a set of actions ai ∈ A• Receives reward r(t) ~ R(x(t) , a(t)) (i.e., can be random)

29

Contextual bandits

For round t = 1, 2,…, the learner• Given xi , chooses from a set of actions ai ∈ A• Receives reward r(t) ~ R(x(t) , a(t)) (i.e., can be random)• Want to maximize total reward

29

Contextual bandits

For round t = 1, 2,…, the learner• Given xi , chooses from a set of actions ai ∈ A• Receives reward r(t) ~ R(x(t) , a(t)) (i.e., can be random)• Want to maximize total reward• You generate your own dataset {(x(t) , a(t) , r(t))}!

29

Contextual bandits

For round t = 1, 2,…, the learner• Given xi , chooses from a set of actions ai ∈ A• Receives reward r(t) ~ R(x(t) , a(t)) (i.e., can be random)• Want to maximize total reward• You generate your own dataset {(x(t) , a(t) , r(t))}!• e.g., for an image, the learner guesses a label, and is told

whether correct or not (reward = 1 if correct and 0 otherwise). Do not know what’s the true label.

29

Contextual bandits



• e.g., for an user, the website recommends a movie, and observes whether the user likes it or not. Do not know what movies the user really want to see.

29

Contextual bandits



• e.g., for an user, the website recommends a movie, and observes whether the user likes it or not. Do not know what movies the user really want to see.

• “Partial information setting”

29

Contextual bandits

Contextual Bandits (cont.)

30

Contextual bandits

Contextual Bandits (cont.)• Simplification: no x, Multi-Armed Bandits (MAB)

30

Contextual bandits

Contextual Bandits (cont.)• Simplification: no x, Multi-Armed Bandits (MAB)• Bandit is a research area by itself. we will not do a lot of bandits

but may go through some material that have important implications on general RL (e.g., lower bounds)

30

RL

For round t = 1, 2,…,

31

RL


• For time step h=1, 2, …, H, the learner

31

RL



• Observes xh(t)

31

RL



• Observes xh(t)

• Chooses ah(t)

31

RL



• Observes xh(t)

• Chooses ah(t)

• Receives rh(t) ~ R(xh(t) , ah(t))

31

RL



• Observes xh(t)

• Chooses ah(t)


• Next xh+1(t) is generated as a function of xh(t) and ah(t)

(or sometimes, all previous x’s and a’s within round t)

31

RL



• Observes xh(t)

• Chooses ah(t)




• Bandits + “Delayed rewards/consequences”

31

RL



• Observes xh(t)

• Chooses ah(t)




• Bandits + “Delayed rewards/consequences”

• The protocol here is for episodic RL (each t is an episode).

31

Why statistical RL?

32

Two types of scenarios in RL research

Why statistical RL?

32


1. Solving a large planning problem using a learning approach

Why statistical RL?

32


1. Solving a large planning problem using a learning approach- e.g., AlphaGo, video game playing, simulated robotics

Why statistical RL?

32


1. Solving a large planning problem using a learning approach- e.g., AlphaGo, video game playing, simulated robotics- Transition dynamics (Go rules) known, but too many states

Why statistical RL?

32


1. Solving a large planning problem using a learning approach- e.g., AlphaGo, video game playing, simulated robotics- Transition dynamics (Go rules) known, but too many states- Run the simulator to collect data

Why statistical RL?

32



2. Solving a learning problem

Why statistical RL?

32



2. Solving a learning problem- e.g., adaptive medical treatment

Why statistical RL?

32



2. Solving a learning problem- e.g., adaptive medical treatment- Transition dynamics unknown (and too many states)

Why statistical RL?

32



2. Solving a learning problem- e.g., adaptive medical treatment- Transition dynamics unknown (and too many states)- Interact with the environment to collect data

Why statistical RL?

32




Why statistical RL?

33

http://hunch.net/?p=8825714




• I am more interested in #2. More challenging in some aspects.

Why statistical RL?

33





• I am more interested in #2. More challenging in some aspects.• Data (real-world interactions) is highest priority. Computation

second.

Why statistical RL?

33






second. • Even for #1, sample complexity lower-bounds computational

complexity — statistical-first approach is reasonable.

Why statistical RL?

33






second. • Even for #1, sample complexity lower-bounds computational

complexity — statistical-first approach is reasonable.• caveat to this argument: you can do a lot more in a simulator;

see http://hunch.net/?p=8825714

Why statistical RL?

33


MDP Planning

Infinite-horizon discounted MDPs

35

An MDP M = (S, A, P, R, γ)


35


• State space S.


35


• State space S.

• Action space A.


35


• State space S.

• Action space A.


35

We will only consider discrete and finite spaces in this course.


• State space S.

• Action space A.

• Transition function P : S×A→∆(S). ∆(S) is the probability simplex over S, i.e., all non-negative vectors of length |S| that sums up to 1


35



• State space S.

• Action space A.


• Reward function R: S×A→ℝ. (deterministic reward function)


35



• State space S.

• Action space A.



• Discount factor γ ∈ [0,1)


35



• State space S.

• Action space A.



• Discount factor γ ∈ [0,1)

• The agent starts in some state s1, takes action a1, receives reward r2 ~ R(s1, a1), transitions to s2 ~ P(s1, a1), takes action a2, so on so forth — the process continues indefinitely


35


• Want to take actions in a way that maximizes value (or return):

Value and policy

36

𝔼 [∑∞t=1 γt−1rt]


• This value depends on where you start and how you act

Value and policy

36

𝔼 [∑∞t=1 γt−1rt]


• This value depends on where you start and how you act• Often assume boundedness of rewards:

Value and policy

36

rt ∈ [0, Rmax]

𝔼 [∑∞t=1 γt−1rt]



• What’s the range of ?

Value and policy

36

rt ∈ [0, Rmax]𝔼 [∑∞

t=1 γt−1rt]

𝔼 [∑∞t=1 γt−1rt]




Value and policy

36

rt ∈ [0, Rmax]𝔼 [∑∞

t=1 γt−1rt] [0,Rmax

1 − γ ]

𝔼 [∑∞t=1 γt−1rt]




• A (deterministic) policy π: S→A describes how the agent acts: at state st, chooses action at = π(st).

Value and policy

36

rt ∈ [0, Rmax]𝔼 [∑∞


1 − γ ]

𝔼 [∑∞t=1 γt−1rt]





• More generally, the agent may choose actions randomly (π: S→ ∆(A)), or even in a way that varies across time steps (“non-stationary policies”)

Value and policy

36

rt ∈ [0, Rmax]𝔼 [∑∞


1 − γ ]

𝔼 [∑∞t=1 γt−1rt]





• More generally, the agent may choose actions randomly (π: S→ ∆(A)), or even in a way that varies across time steps (“non-stationary policies”)

• Define

Value and policy

36

rt ∈ [0, Rmax]𝔼 [∑∞


1 − γ ]

𝔼 [∑∞t=1 γt−1rt]

Vπ(s) = 𝔼 [∑∞t=1 γt−1rt s1 = s, π ]

Bellman equation for policy evaluation

37

V ⇡(s) = E" 1X

t=1

�t�1rt��s1 = s,⇡

#

= E"r1 +

1X

t=2

�t�1rt��s1 = s,⇡

#

= R(s,⇡(s)) +X

s02SP (s0|s,⇡(s))E

"�

1X

t=2

�t�2rt��s1 = s, s2 = s0,⇡

#

= R(s,⇡(s)) +X

s02SP (s0|s,⇡(s))E

"�

1X

t=2

�t�2rt��s2 = s0,⇡

#

= R(s,⇡(s)) + �X

s02SP (s0|s,⇡(s))E

" 1X

t=1

�t�1rt��s1 = s0,⇡

#

= R(s,⇡(s)) + �X

s02SP (s0|s,⇡(s))V ⇡(s0)

= R(s,⇡(s)) + �hP (·|s,⇡(s)), V ⇡(·)i<latexit sha1_base64="EOqUViW3eaQRlDRi0wywMhhOOhU=">AAAFNniczVRbaxNBFJ52V63xluqjLweDJsE07AZBXwpFEXzwIV6SFjLJMjuZbIbOXpiZFcJ2+6d88Xf41hcfFPHVn+DsJoaYVqUFwYFhD+fyfd85nB0/EVxpxznZ2LTsS5evbF2tXLt+4+at6vbtvopTSVmPxiKWBz5RTPCI9TTXgh0kkpHQF2zfP3xWxPffMal4HL3Vs4QNQxJEfMIp0cblbVsv+6MMJzxvqCbswoNjAMAh0VPfz57nWLCJHmCVhl6md918hHk00TPAAQlDMsr0jpuD9DQcY58HR6A814CoFhhEwOYjeTDVQ4wrv4GWpuAh/CToXJQAlgyvG6plYqab5hJY1Q1qyUyJyN7kOXQbqn40RykzC6T1pksFf5TWOVua8jqFUf8/RZ5T2yrBhSTCubfnH4pb7nq9+TdQQaJAMAOC6TjWKzitVaAy2DR7Lct0r1pz2k554LThLowaWpyuV/2IxzFNQxZpKohSA9dJ9DAjUnMqWF7BqWIJoYckYANjRiRkapiVv30O941nDJNYmhtpKL2rFRkJlZqFvsks5qLWY4XzrNgg1ZMnw4xHSapZROdEk1SAjqF4Q2DMJaNazIxBqORGK9ApkYRq89JUzBDc9ZZPG/1O23Xa7qtHtb2ni3FsobvoHmogFz1Ge+gF6qIeotZ768T6bH2xP9if7K/2t3nq5sai5g765djffwCC5qb0</latexit><latexit sha1_base64="EOqUViW3eaQRlDRi0wywMhhOOhU=">AAAFNniczVRbaxNBFJ52V63xluqjLweDJsE07AZBXwpFEXzwIV6SFjLJMjuZbIbOXpiZFcJ2+6d88Xf41hcfFPHVn+DsJoaYVqUFwYFhD+fyfd85nB0/EVxpxznZ2LTsS5evbF2tXLt+4+at6vbtvopTSVmPxiKWBz5RTPCI9TTXgh0kkpHQF2zfP3xWxPffMal4HL3Vs4QNQxJEfMIp0cblbVsv+6MMJzxvqCbswoNjAMAh0VPfz57nWLCJHmCVhl6md918hHk00TPAAQlDMsr0jpuD9DQcY58HR6A814CoFhhEwOYjeTDVQ4wrv4GWpuAh/CToXJQAlgyvG6plYqab5hJY1Q1qyUyJyN7kOXQbqn40RykzC6T1pksFf5TWOVua8jqFUf8/RZ5T2yrBhSTCubfnH4pb7nq9+TdQQaJAMAOC6TjWKzitVaAy2DR7Lct0r1pz2k554LThLowaWpyuV/2IxzFNQxZpKohSA9dJ9DAjUnMqWF7BqWIJoYckYANjRiRkapiVv30O941nDJNYmhtpKL2rFRkJlZqFvsks5qLWY4XzrNgg1ZMnw4xHSapZROdEk1SAjqF4Q2DMJaNazIxBqORGK9ApkYRq89JUzBDc9ZZPG/1O23Xa7qtHtb2ni3FsobvoHmogFz1Ge+gF6qIeotZ768T6bH2xP9if7K/2t3nq5sai5g765djffwCC5qb0</latexit><latexit sha1_base64="EOqUViW3eaQRlDRi0wywMhhOOhU=">AAAFNniczVRbaxNBFJ52V63xluqjLweDJsE07AZBXwpFEXzwIV6SFjLJMjuZbIbOXpiZFcJ2+6d88Xf41hcfFPHVn+DsJoaYVqUFwYFhD+fyfd85nB0/EVxpxznZ2LTsS5evbF2tXLt+4+at6vbtvopTSVmPxiKWBz5RTPCI9TTXgh0kkpHQF2zfP3xWxPffMal4HL3Vs4QNQxJEfMIp0cblbVsv+6MMJzxvqCbswoNjAMAh0VPfz57nWLCJHmCVhl6md918hHk00TPAAQlDMsr0jpuD9DQcY58HR6A814CoFhhEwOYjeTDVQ4wrv4GWpuAh/CToXJQAlgyvG6plYqab5hJY1Q1qyUyJyN7kOXQbqn40RykzC6T1pksFf5TWOVua8jqFUf8/RZ5T2yrBhSTCubfnH4pb7nq9+TdQQaJAMAOC6TjWKzitVaAy2DR7Lct0r1pz2k554LThLowaWpyuV/2IxzFNQxZpKohSA9dJ9DAjUnMqWF7BqWIJoYckYANjRiRkapiVv30O941nDJNYmhtpKL2rFRkJlZqFvsks5qLWY4XzrNgg1ZMnw4xHSapZROdEk1SAjqF4Q2DMJaNazIxBqORGK9ApkYRq89JUzBDc9ZZPG/1O23Xa7qtHtb2ni3FsobvoHmogFz1Ge+gF6qIeotZ768T6bH2xP9if7K/2t3nq5sai5g765djffwCC5qb0</latexit><latexit sha1_base64="EOqUViW3eaQRlDRi0wywMhhOOhU=">AAAFNniczVRbaxNBFJ52V63xluqjLweDJsE07AZBXwpFEXzwIV6SFjLJMjuZbIbOXpiZFcJ2+6d88Xf41hcfFPHVn+DsJoaYVqUFwYFhD+fyfd85nB0/EVxpxznZ2LTsS5evbF2tXLt+4+at6vbtvopTSVmPxiKWBz5RTPCI9TTXgh0kkpHQF2zfP3xWxPffMal4HL3Vs4QNQxJEfMIp0cblbVsv+6MMJzxvqCbswoNjAMAh0VPfz57nWLCJHmCVhl6md918hHk00TPAAQlDMsr0jpuD9DQcY58HR6A814CoFhhEwOYjeTDVQ4wrv4GWpuAh/CToXJQAlgyvG6plYqab5hJY1Q1qyUyJyN7kOXQbqn40RykzC6T1pksFf5TWOVua8jqFUf8/RZ5T2yrBhSTCubfnH4pb7nq9+TdQQaJAMAOC6TjWKzitVaAy2DR7Lct0r1pz2k554LThLowaWpyuV/2IxzFNQxZpKohSA9dJ9DAjUnMqWF7BqWIJoYckYANjRiRkapiVv30O941nDJNYmhtpKL2rFRkJlZqFvsks5qLWY4XzrNgg1ZMnw4xHSapZROdEk1SAjqF4Q2DMJaNazIxBqORGK9ApkYRq89JUzBDc9ZZPG/1O23Xa7qtHtb2ni3FsobvoHmogFz1Ge+gF6qIeotZ768T6bH2xP9if7K/2t3nq5sai5g765djffwCC5qb0</latexit>


37

V ⇡(s) = E" 1X

t=1

�t�1rt��s1 = s,⇡

#

= E"r1 +

1X

t=2

�t�1rt��s1 = s,⇡

#

= R(s,⇡(s)) +X

s02SP (s0|s,⇡(s))E

"�

1X

t=2

�t�2rt��s1 = s, s2 = s0,⇡

#

= R(s,⇡(s)) +X

s02SP (s0|s,⇡(s))E

"�

1X

t=2

�t�2rt��s2 = s0,⇡

#

= R(s,⇡(s)) + �X

s02SP (s0|s,⇡(s))E

" 1X

t=1

�t�1rt��s1 = s0,⇡

#

= R(s,⇡(s)) + �X

s02SP (s0|s,⇡(s))V ⇡(s0)

= R(s,⇡(s)) + �hP (·|s,⇡(s)), V ⇡(·)i<latexit sha1_base64="EOqUViW3eaQRlDRi0wywMhhOOhU=">AAAFNniczVRbaxNBFJ52V63xluqjLweDJsE07AZBXwpFEXzwIV6SFjLJMjuZbIbOXpiZFcJ2+6d88Xf41hcfFPHVn+DsJoaYVqUFwYFhD+fyfd85nB0/EVxpxznZ2LTsS5evbF2tXLt+4+at6vbtvopTSVmPxiKWBz5RTPCI9TTXgh0kkpHQF2zfP3xWxPffMal4HL3Vs4QNQxJEfMIp0cblbVsv+6MMJzxvqCbswoNjAMAh0VPfz57nWLCJHmCVhl6md918hHk00TPAAQlDMsr0jpuD9DQcY58HR6A814CoFhhEwOYjeTDVQ4wrv4GWpuAh/CToXJQAlgyvG6plYqab5hJY1Q1qyUyJyN7kOXQbqn40RykzC6T1pksFf5TWOVua8jqFUf8/RZ5T2yrBhSTCubfnH4pb7nq9+TdQQaJAMAOC6TjWKzitVaAy2DR7Lct0r1pz2k554LThLowaWpyuV/2IxzFNQxZpKohSA9dJ9DAjUnMqWF7BqWIJoYckYANjRiRkapiVv30O941nDJNYmhtpKL2rFRkJlZqFvsks5qLWY4XzrNgg1ZMnw4xHSapZROdEk1SAjqF4Q2DMJaNazIxBqORGK9ApkYRq89JUzBDc9ZZPG/1O23Xa7qtHtb2ni3FsobvoHmogFz1Ge+gF6qIeotZ768T6bH2xP9if7K/2t3nq5sai5g765djffwCC5qb0</latexit><latexit sha1_base64="EOqUViW3eaQRlDRi0wywMhhOOhU=">AAAFNniczVRbaxNBFJ52V63xluqjLweDJsE07AZBXwpFEXzwIV6SFjLJMjuZbIbOXpiZFcJ2+6d88Xf41hcfFPHVn+DsJoaYVqUFwYFhD+fyfd85nB0/EVxpxznZ2LTsS5evbF2tXLt+4+at6vbtvopTSVmPxiKWBz5RTPCI9TTXgh0kkpHQF2zfP3xWxPffMal4HL3Vs4QNQxJEfMIp0cblbVsv+6MMJzxvqCbswoNjAMAh0VPfz57nWLCJHmCVhl6md918hHk00TPAAQlDMsr0jpuD9DQcY58HR6A814CoFhhEwOYjeTDVQ4wrv4GWpuAh/CToXJQAlgyvG6plYqab5hJY1Q1qyUyJyN7kOXQbqn40RykzC6T1pksFf5TWOVua8jqFUf8/RZ5T2yrBhSTCubfnH4pb7nq9+TdQQaJAMAOC6TjWKzitVaAy2DR7Lct0r1pz2k554LThLowaWpyuV/2IxzFNQxZpKohSA9dJ9DAjUnMqWF7BqWIJoYckYANjRiRkapiVv30O941nDJNYmhtpKL2rFRkJlZqFvsks5qLWY4XzrNgg1ZMnw4xHSapZROdEk1SAjqF4Q2DMJaNazIxBqORGK9ApkYRq89JUzBDc9ZZPG/1O23Xa7qtHtb2ni3FsobvoHmogFz1Ge+gF6qIeotZ768T6bH2xP9if7K/2t3nq5sai5g765djffwCC5qb0</latexit><latexit sha1_base64="EOqUViW3eaQRlDRi0wywMhhOOhU=">AAAFNniczVRbaxNBFJ52V63xluqjLweDJsE07AZBXwpFEXzwIV6SFjLJMjuZbIbOXpiZFcJ2+6d88Xf41hcfFPHVn+DsJoaYVqUFwYFhD+fyfd85nB0/EVxpxznZ2LTsS5evbF2tXLt+4+at6vbtvopTSVmPxiKWBz5RTPCI9TTXgh0kkpHQF2zfP3xWxPffMal4HL3Vs4QNQxJEfMIp0cblbVsv+6MMJzxvqCbswoNjAMAh0VPfz57nWLCJHmCVhl6md918hHk00TPAAQlDMsr0jpuD9DQcY58HR6A814CoFhhEwOYjeTDVQ4wrv4GWpuAh/CToXJQAlgyvG6plYqab5hJY1Q1qyUyJyN7kOXQbqn40RykzC6T1pksFf5TWOVua8jqFUf8/RZ5T2yrBhSTCubfnH4pb7nq9+TdQQaJAMAOC6TjWKzitVaAy2DR7Lct0r1pz2k554LThLowaWpyuV/2IxzFNQxZpKohSA9dJ9DAjUnMqWF7BqWIJoYckYANjRiRkapiVv30O941nDJNYmhtpKL2rFRkJlZqFvsks5qLWY4XzrNgg1ZMnw4xHSapZROdEk1SAjqF4Q2DMJaNazIxBqORGK9ApkYRq89JUzBDc9ZZPG/1O23Xa7qtHtb2ni3FsobvoHmogFz1Ge+gF6qIeotZ768T6bH2xP9if7K/2t3nq5sai5g765djffwCC5qb0</latexit><latexit sha1_base64="EOqUViW3eaQRlDRi0wywMhhOOhU=">AAAFNniczVRbaxNBFJ52V63xluqjLweDJsE07AZBXwpFEXzwIV6SFjLJMjuZbIbOXpiZFcJ2+6d88Xf41hcfFPHVn+DsJoaYVqUFwYFhD+fyfd85nB0/EVxpxznZ2LTsS5evbF2tXLt+4+at6vbtvopTSVmPxiKWBz5RTPCI9TTXgh0kkpHQF2zfP3xWxPffMal4HL3Vs4QNQxJEfMIp0cblbVsv+6MMJzxvqCbswoNjAMAh0VPfz57nWLCJHmCVhl6md918hHk00TPAAQlDMsr0jpuD9DQcY58HR6A814CoFhhEwOYjeTDVQ4wrv4GWpuAh/CToXJQAlgyvG6plYqab5hJY1Q1qyUyJyN7kOXQbqn40RykzC6T1pksFf5TWOVua8jqFUf8/RZ5T2yrBhSTCubfnH4pb7nq9+TdQQaJAMAOC6TjWKzitVaAy2DR7Lct0r1pz2k554LThLowaWpyuV/2IxzFNQxZpKohSA9dJ9DAjUnMqWF7BqWIJoYckYANjRiRkapiVv30O941nDJNYmhtpKL2rFRkJlZqFvsks5qLWY4XzrNgg1ZMnw4xHSapZROdEk1SAjqF4Q2DMJaNazIxBqORGK9ApkYRq89JUzBDc9ZZPG/1O23Xa7qtHtb2ni3FsobvoHmogFz1Ge+gF6qIeotZ768T6bH2xP9if7K/2t3nq5sai5g765djffwCC5qb0</latexit>


38

Matrix form: define

• Vπ as the |S|×1 vector [Vπ(s)]s∈S

Vπ(s) = R(s, π(s)) + γ⟨P( ⋅ |s, π(s)), Vπ( ⋅ )⟩


38

Matrix form: define


• Rπ as the vector [R(s, π(s))]s∈S

Vπ(s) = R(s, π(s)) + γ⟨P( ⋅ |s, π(s)), Vπ( ⋅ )⟩


38

Matrix form: define



• Pπ as the matrix [P(s’|s, π(s))]s∈S, s’∈S

Vπ(s) = R(s, π(s)) + γ⟨P( ⋅ |s, π(s)), Vπ( ⋅ )⟩


38

Matrix form: define




Vπ(s) = R(s, π(s)) + γ⟨P( ⋅ |s, π(s)), Vπ( ⋅ )⟩

Vπ = Rπ + γPπVπ


38

Matrix form: define




Vπ(s) = R(s, π(s)) + γ⟨P( ⋅ |s, π(s)), Vπ( ⋅ )⟩


(I − γPπ)Vπ = Rπ


38

Matrix form: define




Vπ(s) = R(s, π(s)) + γ⟨P( ⋅ |s, π(s)), Vπ( ⋅ )⟩



Vπ = (I − γPπ)−1Rπ


38

Matrix form: define




Vπ(s) = R(s, π(s)) + γ⟨P( ⋅ |s, π(s)), Vπ( ⋅ )⟩



Vπ = (I − γPπ)−1Rπ

This is always invertible. Proof?

State occupancy

39

(I − γPπ)−1

Each row (indexed by s) is the discounted state occupancy dsπ,

whose (s’)-th entry is

State occupancy

39

(I − γPπ)−1



State occupancy

39

(I − γPπ)−1

dπs (s′�) = 𝔼 [

∞

∑t=1

γt−1 𝕀[st = s′�] s1 = s, π ]



• Each row is like a distribution vector—except that the entries sum up to 1/(1-γ). Let denote the normalized vector.

State occupancy

39

(I − γPπ)−1

dπs (s′�) = 𝔼 [

∞

∑t=1

γt−1 𝕀[st = s′�] s1 = s, π ]ηπ

s = (1 − γ) dπs




• Vπ(s) is the dot product between and reward vector

State occupancy

39

(I − γPπ)−1

dπs (s′�) = 𝔼 [

∞

∑t=1

γt−1 𝕀[st = s′�] s1 = s, π ]ηπ

s = (1 − γ) dπs

dπs




• Vπ(s) is the dot product between and reward vector

• Can also be interpreted as the value function of indicator reward function

State occupancy

39

(I − γPπ)−1

dπs (s′�) = 𝔼 [

∞

∑t=1

γt−1 𝕀[st = s′�] s1 = s, π ]ηπ

s = (1 − γ) dπs

dπs

• For infinite-horizon discounted MDPs, there always exists a stationary and deterministic policy that is optimal for all starting states simultaneously• Proof: Puterman’94, Thm 6.2.7 (reference due to Shipra Agrawal)

Optimality

40


• Let π* denote this optimal policy, and V* := Vπ*

Optimality

40



• Bellman Optimality Equation:

Optimality

40

V⋆(s) = maxa∈A (R(s, a) + γ𝔼s′�∼P(s,a) [V⋆(s′�)])




• If we know V*, how to get π* ?

Optimality

40






Optimality

40






• Easier to work with Q-values: Q*(s, a), as

Optimality

40


π⋆(s) = arg maxa∈A

Q⋆(s, a)





• Easier to work with Q-values: Q*(s, a), as

Optimality

40


π⋆(s) = arg maxa∈A

Q⋆(s, a)

Q⋆(s, a) = R(s, a) + γ𝔼s′�∼P(s,a) [maxa′�∈A

Q⋆(s′�, a′�)]

Ad Hoc Homework 1

41

• uploaded on course website

Ad Hoc Homework 1

41


• help understand the relationships between alternative MDP formulations

Ad Hoc Homework 1

41



• more like readings w/ questions to think about

Ad Hoc Homework 1

41




• no need to submit

Ad Hoc Homework 1

41




• no need to submit

• will go through in class next week

Ad Hoc Homework 1

41

Documents

CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •