CS 598 Statistical Reinforcement...

Preview:

Citation preview

CS 598 Statistical Reinforcement Learning

Nan Jiang

Overview

What’s this course about?

• A grad-level seminar course on theory of RL

3

What’s this course about?

• A grad-level seminar course on theory of RL• with focus on sample complexity analyses

3

What’s this course about?

• A grad-level seminar course on theory of RL• with focus on sample complexity analyses• all about proofs, some perspectives, 0 implementation

3

What’s this course about?

• A grad-level seminar course on theory of RL• with focus on sample complexity analyses• all about proofs, some perspectives, 0 implementation• Seminar course can be anywhere between students presenting

papers and a rigorous course

3

What’s this course about?

• A grad-level seminar course on theory of RL• with focus on sample complexity analyses• all about proofs, some perspectives, 0 implementation• Seminar course can be anywhere between students presenting

papers and a rigorous course • In this course I will deliver most (or all) of the lectures

3

What’s this course about?

• A grad-level seminar course on theory of RL• with focus on sample complexity analyses• all about proofs, some perspectives, 0 implementation• Seminar course can be anywhere between students presenting

papers and a rigorous course • In this course I will deliver most (or all) of the lectures• No text book; material is created by myself (course notes)

3

What’s this course about?

• A grad-level seminar course on theory of RL• with focus on sample complexity analyses• all about proofs, some perspectives, 0 implementation• Seminar course can be anywhere between students presenting

papers and a rigorous course • In this course I will deliver most (or all) of the lectures• No text book; material is created by myself (course notes)• will share slides/notes on course website

3

Who should take this course?

• This course will be a good fit for you if you…

4

Who should take this course?

• This course will be a good fit for you if you…• are interested in understanding RL mathematically

4

Who should take this course?

• This course will be a good fit for you if you…• are interested in understanding RL mathematically• want to do research in theory of RL or related area

4

Who should take this course?

• This course will be a good fit for you if you…• are interested in understanding RL mathematically• want to do research in theory of RL or related area• want to work with me

4

Who should take this course?

• This course will be a good fit for you if you…• are interested in understanding RL mathematically• want to do research in theory of RL or related area• want to work with me• are comfortable with maths

4

Who should take this course?

• This course will be a good fit for you if you…• are interested in understanding RL mathematically• want to do research in theory of RL or related area• want to work with me• are comfortable with maths

• This course will not be a good fit if you…

4

Who should take this course?

• This course will be a good fit for you if you…• are interested in understanding RL mathematically• want to do research in theory of RL or related area• want to work with me• are comfortable with maths

• This course will not be a good fit if you…• are mostly interested in implementing RL algorithms and

making things to work we won’t cover any engineering tricks (which are essential to e.g., Deep RL); check other RL courses offered across the campus

4

Prerequisites

• Maths

5

Prerequisites

• Maths• Linear algebra, probability & statistics, basic calculus

5

Prerequisites

• Maths• Linear algebra, probability & statistics, basic calculus• Markov chains

5

Prerequisites

• Maths• Linear algebra, probability & statistics, basic calculus• Markov chains• Optional: stochastic processes, numerical analysis

5

Prerequisites

• Maths• Linear algebra, probability & statistics, basic calculus• Markov chains• Optional: stochastic processes, numerical analysis• Useful for research: TCS background, empirical processes

and statistical learning theory, optimization, online learning

5

Prerequisites

• Maths• Linear algebra, probability & statistics, basic calculus• Markov chains• Optional: stochastic processes, numerical analysis• Useful for research: TCS background, empirical processes

and statistical learning theory, optimization, online learning• Exposure to ML

5

Prerequisites

• Maths• Linear algebra, probability & statistics, basic calculus• Markov chains• Optional: stochastic processes, numerical analysis• Useful for research: TCS background, empirical processes

and statistical learning theory, optimization, online learning• Exposure to ML

• e.g., CS 446 Machine Learning

5

Prerequisites

• Maths• Linear algebra, probability & statistics, basic calculus• Markov chains• Optional: stochastic processes, numerical analysis• Useful for research: TCS background, empirical processes

and statistical learning theory, optimization, online learning• Exposure to ML

• e.g., CS 446 Machine Learning• Experience with RL

5

Coursework

• Some readings after/before class

6

Coursework

• Some readings after/before class• Ad hoc homework to help digest particular material. Deadline

will be lenient & TBA at the time of assignment

6

Coursework

• Some readings after/before class• Ad hoc homework to help digest particular material. Deadline

will be lenient & TBA at the time of assignment• Main assignment: course project (work on your own)

6

Coursework

• Some readings after/before class• Ad hoc homework to help digest particular material. Deadline

will be lenient & TBA at the time of assignment• Main assignment: course project (work on your own)

• Baseline: reproduce theoretical analysis in existing papers

6

Coursework

• Some readings after/before class• Ad hoc homework to help digest particular material. Deadline

will be lenient & TBA at the time of assignment• Main assignment: course project (work on your own)

• Baseline: reproduce theoretical analysis in existing papers• Advanced: identify an interesting/challenging extension to

the paper and explore the novel research question yourself

6

Coursework

• Some readings after/before class• Ad hoc homework to help digest particular material. Deadline

will be lenient & TBA at the time of assignment• Main assignment: course project (work on your own)

• Baseline: reproduce theoretical analysis in existing papers• Advanced: identify an interesting/challenging extension to

the paper and explore the novel research question yourself• Or, just work on a novel research question (must have a

significant theoretical component; need to discuss with me)

6

Course project (cont.)

• See list of references and potential topics on website

7

Course project (cont.)

• See list of references and potential topics on website• You will need to submit:

7

Course project (cont.)

• See list of references and potential topics on website• You will need to submit:

• A brief proposal (~1/2 page). Tentative deadline: end of Feb

7

Course project (cont.)

• See list of references and potential topics on website• You will need to submit:

• A brief proposal (~1/2 page). Tentative deadline: end of Feb• what’s the topic and what papers you plan to work on

7

Course project (cont.)

• See list of references and potential topics on website• You will need to submit:

• A brief proposal (~1/2 page). Tentative deadline: end of Feb• what’s the topic and what papers you plan to work on• why you choose the topic: what interest you?

7

Course project (cont.)

• See list of references and potential topics on website• You will need to submit:

• A brief proposal (~1/2 page). Tentative deadline: end of Feb• what’s the topic and what papers you plan to work on• why you choose the topic: what interest you?• which aspect(s) you will focus on?

7

Course project (cont.)

• See list of references and potential topics on website• You will need to submit:

• A brief proposal (~1/2 page). Tentative deadline: end of Feb• what’s the topic and what papers you plan to work on• why you choose the topic: what interest you?• which aspect(s) you will focus on?

• Final report: clarity, precision, and brevity are greatly valued. More details to come…

7

Course project (cont.)

• See list of references and potential topics on website• You will need to submit:

• A brief proposal (~1/2 page). Tentative deadline: end of Feb• what’s the topic and what papers you plan to work on• why you choose the topic: what interest you?• which aspect(s) you will focus on?

• Final report: clarity, precision, and brevity are greatly valued. More details to come…

• All docs should be in pdf. Final report should be prepared using LaTeX.

7

Course project (cont. 2)

Rule of thumb

8

Course project (cont. 2)

Rule of thumb1. learn something that interests you

8

Course project (cont. 2)

Rule of thumb1. learn something that interests you2. teach me something! (I wouldn’t learn if I could not

understand your report due to lack of clarity)

8

Course project (cont. 2)

Rule of thumb1. learn something that interests you2. teach me something! (I wouldn’t learn if I could not

understand your report due to lack of clarity)3. write a report similar to (or better than!) the notes I will share

with you

8

Contents of the course

• many important topics in RL will not be covered in depth (e.g., TD). read more if you want to get a more comprehensive view of RL

9

Contents of the course

• many important topics in RL will not be covered in depth (e.g., TD). read more if you want to get a more comprehensive view of RL

• the other opportunity to learn what’s not covered in lectures is the project, as potential topics for projects are much broader than what’s covered in class.

9

My goals

• Encourage you to do research in this area

10

My goals

• Encourage you to do research in this area• Introduce useful mathematical tools to you

10

My goals

• Encourage you to do research in this area• Introduce useful mathematical tools to you• Learn from you

10

• Course website: http://nanjiang.cs.illinois.edu/cs598/

11

Logistics

• Course website: http://nanjiang.cs.illinois.edu/cs598/• logistics, links to slides/notes (uploaded after lectures),

and resources (e.g., textbooks to consult, related courses), deadlines & announcements

11

Logistics

• Course website: http://nanjiang.cs.illinois.edu/cs598/• logistics, links to slides/notes (uploaded after lectures),

and resources (e.g., textbooks to consult, related courses), deadlines & announcements

• Time & Location: Tue & Thu 12:30-1:45pm, 1304 Siebel.

11

Logistics

• Course website: http://nanjiang.cs.illinois.edu/cs598/• logistics, links to slides/notes (uploaded after lectures),

and resources (e.g., textbooks to consult, related courses), deadlines & announcements

• Time & Location: Tue & Thu 12:30-1:45pm, 1304 Siebel.• TA: Jinglin Chen (jinglinc)

11

Logistics

• Course website: http://nanjiang.cs.illinois.edu/cs598/• logistics, links to slides/notes (uploaded after lectures),

and resources (e.g., textbooks to consult, related courses), deadlines & announcements

• Time & Location: Tue & Thu 12:30-1:45pm, 1304 Siebel.• TA: Jinglin Chen (jinglinc)• Office hours: By appointment. 3322 Siebel.

11

Logistics

• Course website: http://nanjiang.cs.illinois.edu/cs598/• logistics, links to slides/notes (uploaded after lectures),

and resources (e.g., textbooks to consult, related courses), deadlines & announcements

• Time & Location: Tue & Thu 12:30-1:45pm, 1304 Siebel.• TA: Jinglin Chen (jinglinc)• Office hours: By appointment. 3322 Siebel.

• Questions about material: ad hoc meetings after lectures subject to my availability

11

Logistics

• Course website: http://nanjiang.cs.illinois.edu/cs598/• logistics, links to slides/notes (uploaded after lectures),

and resources (e.g., textbooks to consult, related courses), deadlines & announcements

• Time & Location: Tue & Thu 12:30-1:45pm, 1304 Siebel.• TA: Jinglin Chen (jinglinc)• Office hours: By appointment. 3322 Siebel.

• Questions about material: ad hoc meetings after lectures subject to my availability

• Other things about the class (e.g., project): by appointment

11

Logistics

Introduction to MDPs and RL

13

Reinforcement Learning (RL)Applications

13

[Levine et al’16] [Ng et al’03]

Reinforcement Learning (RL)Applications

13

[Levine et al’16] [Ng et al’03] [Singh et al’02]

Reinforcement Learning (RL)Applications

13

[Levine et al’16] [Ng et al’03] [Singh et al’02]

Reinforcement Learning (RL)

[Lei et al’12]

Applications

13

[Levine et al’16] [Ng et al’03][Mandel et al’16]

[Singh et al’02]

Reinforcement Learning (RL)

[Lei et al’12]

Applications

13

[Levine et al’16] [Ng et al’03][Mandel et al’16]

[Singh et al’02]

Reinforcement Learning (RL)

[Lei et al’12][Tesauro et al’07]

Applications

13

[Levine et al’16] [Ng et al’03][Mandel et al’16]

[Singh et al’02]

Reinforcement Learning (RL)

[Mnih et al’15][Silver et al’16][Lei et al’12]

[Tesauro et al’07]

Applications

14

s0

b

c

d

e

f

g1

2

3

2

1

3

14

1

Shortest Path

14

s0

b

c

d

e

f

g1

2

3

2

1

3

14

1

Shortest Path

state

action

14

s0

b

c

d

e

f

g1

2

3

2

1

3

14

Greedy is suboptimal due to delayed effects

1

Need long-term planning

Shortest Path

state

action

14

s0

b

c

d

e

f

g1

2

3

2

1

3

14

Greedy is suboptimal due to delayed effects

1

V*(g) = 0

V*(f) = 1

Need long-term planning

Shortest Path

state

action

14

s0

b

c

d

e

f

g1

2

3

2

1

3

14

Greedy is suboptimal due to delayed effects

1

V*(g) = 0

V*(f) = 1

V*(d) = min{3 + V*(g) , 2 + V*(f)}

Need long-term planning

Shortest Path

state

action

Bellman Equation

14

s0

b

c

d

e

f

g1

2

3

2

1

3

14

Greedy is suboptimal due to delayed effects

1

V*(g) = 0

V*(f) = 1

V*(d) = 2

V*(d) = min{3 + V*(g) , 2 + V*(f)}

Need long-term planning

Shortest Path

state

action

Bellman Equation

14

s0

b

c

d

e

f

g1

2

3

2

1

3

14

Greedy is suboptimal due to delayed effects

1

V*(g) = 0

V*(f) = 1

V*(d) = 2

V*(e) = 2V*(c) = 4

V*(b) = 5

V*(s0) = 6

V*(d) = min{3 + V*(g) , 2 + V*(f)}

Need long-term planning

Shortest Path

Bellman Equation

14

s0

b

c

d

e

f

g1

2

3

2

1

3

14

Greedy is suboptimal due to delayed effects

1

V*(g) = 0

V*(f) = 1

V*(d) = 2

V*(e) = 2V*(c) = 4

V*(b) = 5

V*(s0) = 6

V*(d) = min{3 + V*(g) , 2 + V*(f)}

Need long-term planning

Shortest Path

Bellman Equation

15

s0

b

c

d

e

f

g1

2

2

1

3

14

1

Shortest Path

4

16

s0

b

c

d

e

f

g0.7

0.3

Stochastic

1

2

2

1

3

14

1

Shortest Path

e

0.50.5

4

16

s0

b

c

d

e

f

g0.7

0.3

Stochastic

1

2

2

1

3

14

1

Shortest Path

transition distribution

e

0.50.5

Markov Decision Process (MDP)

4

state

action

1

17

s0

b

c

d

e

f

g

0.50.5

1

2

4

2

1

3

14

0.7

0.3

Stochastic Shortest Path

Greedy is suboptimal due to delayed effectsNeed long-term planning

1

17

s0

b

c

d

e

f

g

0.50.5

1

2

4

2

1

3

14

0.7

0.3

V*(c) = min{4 + 0.7 × V*(d) + 0.3 × V*(e) , 2 + V*(e)}

Stochastic Shortest Path

Bellman Equation

Greedy is suboptimal due to delayed effectsNeed long-term planning

1

17

s0

b

c

d

e

f

g

0.50.5

1

2

4

2

1

3

14

0.7

0.3

V*(c) = min{4 + 0.7 × V*(d) + 0.3 × V*(e) , 2 + V*(e)}

Stochastic Shortest Path

V*(g) = 0

Bellman Equation

Greedy is suboptimal due to delayed effectsNeed long-term planning

1

17

s0

b

c

d

e

f

g

0.50.5

1

2

4

2

1

3

14

0.7

0.3

V*(c) = min{4 + 0.7 × V*(d) + 0.3 × V*(e) , 2 + V*(e)}

Stochastic Shortest Path

V*(g) = 0

V*(f) = 1

V*(d) = 2

V*(e) = 2.5V*(c) = 4.5

V*(b) = 6

V*(s0) = 6.5

Bellman Equation

Greedy is suboptimal due to delayed effectsNeed long-term planning

1

17

s0

b

c

d

e

f

g

0.50.5

1

2

4

2

1

3

14

0.7

0.3

V*(c) = min{4 + 0.7 × V*(d) + 0.3 × V*(e) , 2 + V*(e)}

Stochastic Shortest Path

✓: optimal policy π*

V*(g) = 0

V*(f) = 1

V*(d) = 2

V*(e) = 2.5V*(c) = 4.5

V*(b) = 6

V*(s0) = 6.5

Bellman Equation

Greedy is suboptimal due to delayed effectsNeed long-term planning

✓✓

✓ ✓

?

18

s0

b

c

d

e

f

g

1

1

2

4

2

1

3

14 0.5

0.5

0.7

0.3

via trial-and-errorStochastic Shortest Path

Trajectory 1: s0↘ c ↗ d → g

18

s0

b

c

d

e

f

g

1

1

2

4

2

1

3

14 0.5

0.5

0.7

0.3

via trial-and-errorStochastic Shortest Path

Trajectory 1: s0↘ c ↗ d → g

18

s0

b

c

d

e

f

g

1

1

2

4

2

1

3

14 0.5

0.5

0.7

0.3

via trial-and-errorStochastic Shortest Path

Trajectory 1: s0↘ c ↗ d → g

18

s0

b

c

d

e

f

g

1

1

2

4

2

1

3

14 0.5

0.5

0.7

0.3

via trial-and-errorStochastic Shortest Path

Trajectory 1: s0↘ c ↗ d → g

18

s0

b

c

d

e

f

g

1

1

2

4

2

1

3

14 0.5

0.5

0.7

0.3

via trial-and-errorStochastic Shortest Path

Trajectory 1: s0↘ c ↗ d → g

18

s0

b

c

d

e

f

g

1

1

2

4

2

1

3

14 0.5

0.5

0.7

0.3

via trial-and-errorStochastic Shortest Path

Trajectory 1: s0↘ c ↗ d → g

18

s0

b

c

d

e

f

g

1

1

2

4

2

1

3

14 0.5

0.5

0.7

0.3

via trial-and-errorStochastic Shortest Path

1

2

4

2

1

3

14

1

19

s0

b

c

d

e

f

g

0.50.5

0.7

0.3

Trajectory 2: s0↘ c ↗ e → f ↗ g

via trial-and-error

Trajectory 1: s0↘ c ↗ d → g

Shortest PathStochastic

1

2

4

2

1

3

14

1

19

s0

b

c

d

e

f

g

0.50.5

0.7

0.3s0

c ee

f

g

Trajectory 2: s0↘ c ↗ e → f ↗ g

via trial-and-error

Trajectory 1: s0↘ c ↗ d → g

Shortest PathStochastic

20

s0

b

c

d

e

f

g1

2

4

2

1

3

14

1

via trial-and-error

Trajectory 2: s0↘ c ↗ e → f ↗ g

Trajectory 1: s0↘ c ↗ d → g

Shortest PathStochastic

20

s0

b

c

d

e

f

g1

2

4

2

1

3

14

1

via trial-and-error

Trajectory 2: s0↘ c ↗ e → f ↗ g

Trajectory 1: s0↘ c ↗ d → g

Shortest PathStochastic

20

0.72

0.28s0

b

c

d

e

f

g1

2

4

2

1

3

14

1

via trial-and-error

Trajectory 2: s0↘ c ↗ e → f ↗ g

Trajectory 1: s0↘ c ↗ d → g

Shortest PathStochastic

20

0.72

0.28s0

b

c

d

e

f

g

0.55

0.45

1

2

4

2

1

3

14

1

via trial-and-error

Trajectory 2: s0↘ c ↗ e → f ↗ g

Trajectory 1: s0↘ c ↗ d → g

Shortest PathStochastic

20

0.72

0.28s0

b

c

d

e

f

g

0.55

0.45

1

2

4

2

1

3

14

1

via trial-and-error

Trajectory 2: s0↘ c ↗ e → f ↗ g

Trajectory 1: s0↘ c ↗ d → g

Shortest PathStochastic

Model-based RL

20

0.72

0.28s0

b

c

d

e

f

g

0.55

0.45

1

2

4

2

1

3

14

1

via trial-and-error

How many trajectories do we need to compute a near-optimal policy?

Trajectory 2: s0↘ c ↗ e → f ↗ g

Trajectory 1: s0↘ c ↗ d → g

Shortest Path

sample complexity

Stochastic

Model-based RL

21

s0

b

c

d

e

f

g1

2

4

2

1

3

14

1

via trial-and-error

How many trajectories do we need to compute a near-optimal policy?

• Assume states & actions are visited uniformly

Shortest PathStochastic

21

s0

b

c

d

e

f

g1

2

4

2

1

3

14

1

via trial-and-error

How many trajectories do we need to compute a near-optimal policy?

• Assume states & actions are visited uniformly• #trajectories needed ≤ n ⋅ (#state-action pairs)

#samples needed to estimate a multinomial distribution

Shortest PathStochastic

21

via trial-and-error

How many trajectories do we need to compute a near-optimal policy?

• Assume states & actions are visited uniformly• #trajectories needed ≤ n ⋅ (#state-action pairs)

#samples needed to estimate a multinomial distribution

s0

b

c

d

e

f

g1

2

2

1

3

14

1

?

??

?

Shortest PathStochastic

4

21

via trial-and-error

How many trajectories do we need to compute a near-optimal policy?

• Assume states & actions are visited uniformly• #trajectories needed ≤ n ⋅ (#state-action pairs)

#samples needed to estimate a multinomial distribution

s0

b

c

d

e

f

g1

2

2

1

3

14

1

Nontrivial! Need exploration

?

??

?

Shortest PathStochastic

4

22

Video game playing

state st ∈ S

22

Video game playing

action at ∈ A

+20

reward rt = R (st , at)

state st ∈ S

22

Video game playing

e.g., random spawn of enemies

action at ∈ A

+20

reward rt = R (st , at)

transition dynamics P ( ⋅ | st , at)

state st ∈ S

(unknown)

22

Video game playing

e.g., random spawn of enemies

action at ∈ A

+20

reward rt = R (st , at)

transition dynamics P ( ⋅ | st , at)

state st ∈ S

policy π: S → A

objective: maximize (or ) E⇥P

H

t=1 rt |⇡⇤

E⇥P1

t=1 �t�1rt |⇡

(unknown)

22

Video game playing

5 10 15

γ = 0.8

γ = 0.9

5 10 15

H = 5

H = 10

e.g., random spawn of enemies

action at ∈ A

+20

reward rt = R (st , at)

transition dynamics P ( ⋅ | st , at)

state st ∈ S

policy π: S → A

objective: maximize (or ) E⇥P

H

t=1 rt |⇡⇤

E⇥P1

t=1 �t�1rt |⇡

(unknown)

23

Video game playing

23

Need generalization

Video game playing

23

Need generalization

Video game playing

Value function approximation

24

f (x;θ)

state features x

Need generalization

Video game playing

Value function approximation

f (x;θ)

24

f (x;θ)

state features x

state features x

θNeed generalization

Video game playing

Value function approximation

f (x;θ)

24

f (x;θ)

state features x

state features x

θFind θ s.t. f (x;θ) ≈ r + γ ⋅ Ex’|x [ f (x’;θ)]

Need generalization

⇒ f (⋅ ; θ) ≈ V*

Video game playing

Value function approximation

f (x;θ)

24

f (x;θ)

state features x

state features x

θ

x

x’

+r

Find θ s.t. f (x;θ) ≈ r + γ ⋅ Ex’|x [ f (x’;θ)]…

Need generalization

⇒ f (⋅ ; θ) ≈ V*

Video game playing

Value function approximation

25

value

state features x

Adaptive medical treatment

• State: diagnosis • Action: treatment • Reward: progress in recovery

A Machine Learning view of RL

27

Lecture 1: Introduction to Reinforcement Learning

About RL

Many Faces of Reinforcement Learning

Computer Science

Economics

Mathematics

Engineering Neuroscience

Psychology

Machine Learning

Classical/OperantConditioning

Optimal Control

RewardSystem

Operations Research

BoundedRationality

Reinforcement Learning

slide credit: David Silver

Supervised Learning

Given {(x(i), y(i))}, learn f : x ↦ y

28

Supervised Learning

Given {(x(i), y(i))}, learn f : x ↦ y• Online version: for round t = 1, 2,…, the learner

28

Supervised Learning

Given {(x(i), y(i))}, learn f : x ↦ y• Online version: for round t = 1, 2,…, the learner

• observes x(t)

28

^

Supervised Learning

Given {(x(i), y(i))}, learn f : x ↦ y• Online version: for round t = 1, 2,…, the learner

• observes x(t)

• predicts y(t)

28

^

Supervised Learning

Given {(x(i), y(i))}, learn f : x ↦ y• Online version: for round t = 1, 2,…, the learner

• observes x(t)

• predicts y(t)

• receives y(t)

28

^

Supervised Learning

Given {(x(i), y(i))}, learn f : x ↦ y• Online version: for round t = 1, 2,…, the learner

• observes x(t)

• predicts y(t)

• receives y(t)

• Want to maximize # of correct predictions

28

^

Supervised Learning

Given {(x(i), y(i))}, learn f : x ↦ y• Online version: for round t = 1, 2,…, the learner

• observes x(t)

• predicts y(t)

• receives y(t)

• Want to maximize # of correct predictions• e.g., classifies if an image is about a dog, a cat, a plane, etc.

(multi-class classification)

28

^

Supervised Learning

Given {(x(i), y(i))}, learn f : x ↦ y• Online version: for round t = 1, 2,…, the learner

• observes x(t)

• predicts y(t)

• receives y(t)

• Want to maximize # of correct predictions• e.g., classifies if an image is about a dog, a cat, a plane, etc.

(multi-class classification)• Dataset is fixed for everyone

28

^

Supervised Learning

Given {(x(i), y(i))}, learn f : x ↦ y• Online version: for round t = 1, 2,…, the learner

• observes x(t)

• predicts y(t)

• receives y(t)

• Want to maximize # of correct predictions• e.g., classifies if an image is about a dog, a cat, a plane, etc.

(multi-class classification)• Dataset is fixed for everyone• “Full information setting”

28

^

Supervised Learning

Given {(x(i), y(i))}, learn f : x ↦ y• Online version: for round t = 1, 2,…, the learner

• observes x(t)

• predicts y(t)

• receives y(t)

• Want to maximize # of correct predictions• e.g., classifies if an image is about a dog, a cat, a plane, etc.

(multi-class classification)• Dataset is fixed for everyone• “Full information setting”• Core challenge: generalization

28

Contextual bandits

For round t = 1, 2,…, the learner

29

Contextual bandits

For round t = 1, 2,…, the learner• Given xi , chooses from a set of actions ai ∈ A

29

Contextual bandits

For round t = 1, 2,…, the learner• Given xi , chooses from a set of actions ai ∈ A• Receives reward r(t) ~ R(x(t) , a(t)) (i.e., can be random)

29

Contextual bandits

For round t = 1, 2,…, the learner• Given xi , chooses from a set of actions ai ∈ A• Receives reward r(t) ~ R(x(t) , a(t)) (i.e., can be random)• Want to maximize total reward

29

Contextual bandits

For round t = 1, 2,…, the learner• Given xi , chooses from a set of actions ai ∈ A• Receives reward r(t) ~ R(x(t) , a(t)) (i.e., can be random)• Want to maximize total reward• You generate your own dataset {(x(t) , a(t) , r(t))}!

29

Contextual bandits

For round t = 1, 2,…, the learner• Given xi , chooses from a set of actions ai ∈ A• Receives reward r(t) ~ R(x(t) , a(t)) (i.e., can be random)• Want to maximize total reward• You generate your own dataset {(x(t) , a(t) , r(t))}!• e.g., for an image, the learner guesses a label, and is told

whether correct or not (reward = 1 if correct and 0 otherwise). Do not know what’s the true label.

29

Contextual bandits

For round t = 1, 2,…, the learner• Given xi , chooses from a set of actions ai ∈ A• Receives reward r(t) ~ R(x(t) , a(t)) (i.e., can be random)• Want to maximize total reward• You generate your own dataset {(x(t) , a(t) , r(t))}!• e.g., for an image, the learner guesses a label, and is told

whether correct or not (reward = 1 if correct and 0 otherwise). Do not know what’s the true label.

• e.g., for an user, the website recommends a movie, and observes whether the user likes it or not. Do not know what movies the user really want to see.

29

Contextual bandits

For round t = 1, 2,…, the learner• Given xi , chooses from a set of actions ai ∈ A• Receives reward r(t) ~ R(x(t) , a(t)) (i.e., can be random)• Want to maximize total reward• You generate your own dataset {(x(t) , a(t) , r(t))}!• e.g., for an image, the learner guesses a label, and is told

whether correct or not (reward = 1 if correct and 0 otherwise). Do not know what’s the true label.

• e.g., for an user, the website recommends a movie, and observes whether the user likes it or not. Do not know what movies the user really want to see.

• “Partial information setting”

29

Contextual bandits

Contextual Bandits (cont.)

30

Contextual bandits

Contextual Bandits (cont.)• Simplification: no x, Multi-Armed Bandits (MAB)

30

Contextual bandits

Contextual Bandits (cont.)• Simplification: no x, Multi-Armed Bandits (MAB)• Bandit is a research area by itself. we will not do a lot of bandits

but may go through some material that have important implications on general RL (e.g., lower bounds)

30

RL

For round t = 1, 2,…,

31

RL

For round t = 1, 2,…,

• For time step h=1, 2, …, H, the learner

31

RL

For round t = 1, 2,…,

• For time step h=1, 2, …, H, the learner

• Observes xh(t)

31

RL

For round t = 1, 2,…,

• For time step h=1, 2, …, H, the learner

• Observes xh(t)

• Chooses ah(t)

31

RL

For round t = 1, 2,…,

• For time step h=1, 2, …, H, the learner

• Observes xh(t)

• Chooses ah(t)

• Receives rh(t) ~ R(xh(t) , ah(t))

31

RL

For round t = 1, 2,…,

• For time step h=1, 2, …, H, the learner

• Observes xh(t)

• Chooses ah(t)

• Receives rh(t) ~ R(xh(t) , ah(t))

• Next xh+1(t) is generated as a function of xh(t) and ah(t)

(or sometimes, all previous x’s and a’s within round t)

31

RL

For round t = 1, 2,…,

• For time step h=1, 2, …, H, the learner

• Observes xh(t)

• Chooses ah(t)

• Receives rh(t) ~ R(xh(t) , ah(t))

• Next xh+1(t) is generated as a function of xh(t) and ah(t)

(or sometimes, all previous x’s and a’s within round t)

• Bandits + “Delayed rewards/consequences”

31

RL

For round t = 1, 2,…,

• For time step h=1, 2, …, H, the learner

• Observes xh(t)

• Chooses ah(t)

• Receives rh(t) ~ R(xh(t) , ah(t))

• Next xh+1(t) is generated as a function of xh(t) and ah(t)

(or sometimes, all previous x’s and a’s within round t)

• Bandits + “Delayed rewards/consequences”

• The protocol here is for episodic RL (each t is an episode).

31

Why statistical RL?

32

Two types of scenarios in RL research

Why statistical RL?

32

Two types of scenarios in RL research

1. Solving a large planning problem using a learning approach

Why statistical RL?

32

Two types of scenarios in RL research

1. Solving a large planning problem using a learning approach- e.g., AlphaGo, video game playing, simulated robotics

Why statistical RL?

32

Two types of scenarios in RL research

1. Solving a large planning problem using a learning approach- e.g., AlphaGo, video game playing, simulated robotics- Transition dynamics (Go rules) known, but too many states

Why statistical RL?

32

Two types of scenarios in RL research

1. Solving a large planning problem using a learning approach- e.g., AlphaGo, video game playing, simulated robotics- Transition dynamics (Go rules) known, but too many states- Run the simulator to collect data

Why statistical RL?

32

Two types of scenarios in RL research

1. Solving a large planning problem using a learning approach- e.g., AlphaGo, video game playing, simulated robotics- Transition dynamics (Go rules) known, but too many states- Run the simulator to collect data

2. Solving a learning problem

Why statistical RL?

32

Two types of scenarios in RL research

1. Solving a large planning problem using a learning approach- e.g., AlphaGo, video game playing, simulated robotics- Transition dynamics (Go rules) known, but too many states- Run the simulator to collect data

2. Solving a learning problem- e.g., adaptive medical treatment

Why statistical RL?

32

Two types of scenarios in RL research

1. Solving a large planning problem using a learning approach- e.g., AlphaGo, video game playing, simulated robotics- Transition dynamics (Go rules) known, but too many states- Run the simulator to collect data

2. Solving a learning problem- e.g., adaptive medical treatment- Transition dynamics unknown (and too many states)

Why statistical RL?

32

Two types of scenarios in RL research

1. Solving a large planning problem using a learning approach- e.g., AlphaGo, video game playing, simulated robotics- Transition dynamics (Go rules) known, but too many states- Run the simulator to collect data

2. Solving a learning problem- e.g., adaptive medical treatment- Transition dynamics unknown (and too many states)- Interact with the environment to collect data

Why statistical RL?

32

Two types of scenarios in RL research

1. Solving a large planning problem using a learning approach

2. Solving a learning problem

Why statistical RL?

33

Two types of scenarios in RL research

1. Solving a large planning problem using a learning approach

2. Solving a learning problem

• I am more interested in #2. More challenging in some aspects.

Why statistical RL?

33

Two types of scenarios in RL research

1. Solving a large planning problem using a learning approach

2. Solving a learning problem

• I am more interested in #2. More challenging in some aspects.• Data (real-world interactions) is highest priority. Computation

second.

Why statistical RL?

33

Two types of scenarios in RL research

1. Solving a large planning problem using a learning approach

2. Solving a learning problem

• I am more interested in #2. More challenging in some aspects.• Data (real-world interactions) is highest priority. Computation

second. • Even for #1, sample complexity lower-bounds computational

complexity — statistical-first approach is reasonable.

Why statistical RL?

33

Two types of scenarios in RL research

1. Solving a large planning problem using a learning approach

2. Solving a learning problem

• I am more interested in #2. More challenging in some aspects.• Data (real-world interactions) is highest priority. Computation

second. • Even for #1, sample complexity lower-bounds computational

complexity — statistical-first approach is reasonable.• caveat to this argument: you can do a lot more in a simulator;

see http://hunch.net/?p=8825714

Why statistical RL?

33

MDP Planning

Infinite-horizon discounted MDPs

35

An MDP M = (S, A, P, R, γ)

Infinite-horizon discounted MDPs

35

An MDP M = (S, A, P, R, γ)

• State space S.

Infinite-horizon discounted MDPs

35

An MDP M = (S, A, P, R, γ)

• State space S.

• Action space A.

Infinite-horizon discounted MDPs

35

An MDP M = (S, A, P, R, γ)

• State space S.

• Action space A.

Infinite-horizon discounted MDPs

35

We will only consider discrete and finite spaces in this course.

An MDP M = (S, A, P, R, γ)

• State space S.

• Action space A.

• Transition function P : S×A→∆(S). ∆(S) is the probability simplex over S, i.e., all non-negative vectors of length |S| that sums up to 1

Infinite-horizon discounted MDPs

35

We will only consider discrete and finite spaces in this course.

An MDP M = (S, A, P, R, γ)

• State space S.

• Action space A.

• Transition function P : S×A→∆(S). ∆(S) is the probability simplex over S, i.e., all non-negative vectors of length |S| that sums up to 1

• Reward function R: S×A→ℝ. (deterministic reward function)

Infinite-horizon discounted MDPs

35

We will only consider discrete and finite spaces in this course.

An MDP M = (S, A, P, R, γ)

• State space S.

• Action space A.

• Transition function P : S×A→∆(S). ∆(S) is the probability simplex over S, i.e., all non-negative vectors of length |S| that sums up to 1

• Reward function R: S×A→ℝ. (deterministic reward function)

• Discount factor γ ∈ [0,1)

Infinite-horizon discounted MDPs

35

We will only consider discrete and finite spaces in this course.

An MDP M = (S, A, P, R, γ)

• State space S.

• Action space A.

• Transition function P : S×A→∆(S). ∆(S) is the probability simplex over S, i.e., all non-negative vectors of length |S| that sums up to 1

• Reward function R: S×A→ℝ. (deterministic reward function)

• Discount factor γ ∈ [0,1)

• The agent starts in some state s1, takes action a1, receives reward r2 ~ R(s1, a1), transitions to s2 ~ P(s1, a1), takes action a2, so on so forth — the process continues indefinitely

Infinite-horizon discounted MDPs

35

We will only consider discrete and finite spaces in this course.

• Want to take actions in a way that maximizes value (or return):

Value and policy

36

𝔼 [∑∞t=1 γt−1rt]

• Want to take actions in a way that maximizes value (or return):

• This value depends on where you start and how you act

Value and policy

36

𝔼 [∑∞t=1 γt−1rt]

• Want to take actions in a way that maximizes value (or return):

• This value depends on where you start and how you act• Often assume boundedness of rewards:

Value and policy

36

rt ∈ [0, Rmax]

𝔼 [∑∞t=1 γt−1rt]

• Want to take actions in a way that maximizes value (or return):

• This value depends on where you start and how you act• Often assume boundedness of rewards:

• What’s the range of ?

Value and policy

36

rt ∈ [0, Rmax]𝔼 [∑∞

t=1 γt−1rt]

𝔼 [∑∞t=1 γt−1rt]

• Want to take actions in a way that maximizes value (or return):

• This value depends on where you start and how you act• Often assume boundedness of rewards:

• What’s the range of ?

Value and policy

36

rt ∈ [0, Rmax]𝔼 [∑∞

t=1 γt−1rt] [0,Rmax

1 − γ ]

𝔼 [∑∞t=1 γt−1rt]

• Want to take actions in a way that maximizes value (or return):

• This value depends on where you start and how you act• Often assume boundedness of rewards:

• What’s the range of ?

• A (deterministic) policy π: S→A describes how the agent acts: at state st, chooses action at = π(st).

Value and policy

36

rt ∈ [0, Rmax]𝔼 [∑∞

t=1 γt−1rt] [0,Rmax

1 − γ ]

𝔼 [∑∞t=1 γt−1rt]

• Want to take actions in a way that maximizes value (or return):

• This value depends on where you start and how you act• Often assume boundedness of rewards:

• What’s the range of ?

• A (deterministic) policy π: S→A describes how the agent acts: at state st, chooses action at = π(st).

• More generally, the agent may choose actions randomly (π: S→ ∆(A)), or even in a way that varies across time steps (“non-stationary policies”)

Value and policy

36

rt ∈ [0, Rmax]𝔼 [∑∞

t=1 γt−1rt] [0,Rmax

1 − γ ]

𝔼 [∑∞t=1 γt−1rt]

• Want to take actions in a way that maximizes value (or return):

• This value depends on where you start and how you act• Often assume boundedness of rewards:

• What’s the range of ?

• A (deterministic) policy π: S→A describes how the agent acts: at state st, chooses action at = π(st).

• More generally, the agent may choose actions randomly (π: S→ ∆(A)), or even in a way that varies across time steps (“non-stationary policies”)

• Define

Value and policy

36

rt ∈ [0, Rmax]𝔼 [∑∞

t=1 γt−1rt] [0,Rmax

1 − γ ]

𝔼 [∑∞t=1 γt−1rt]

Vπ(s) = 𝔼 [∑∞t=1 γt−1rt s1 = s, π ]

Bellman equation for policy evaluation

37

V ⇡(s) = E" 1X

t=1

�t�1rt��s1 = s,⇡

#

= E"r1 +

1X

t=2

�t�1rt��s1 = s,⇡

#

= R(s,⇡(s)) +X

s02SP (s0|s,⇡(s))E

"�

1X

t=2

�t�2rt��s1 = s, s2 = s0,⇡

#

= R(s,⇡(s)) +X

s02SP (s0|s,⇡(s))E

"�

1X

t=2

�t�2rt��s2 = s0,⇡

#

= R(s,⇡(s)) + �X

s02SP (s0|s,⇡(s))E

" 1X

t=1

�t�1rt��s1 = s0,⇡

#

= R(s,⇡(s)) + �X

s02SP (s0|s,⇡(s))V ⇡(s0)

= R(s,⇡(s)) + �hP (·|s,⇡(s)), V ⇡(·)i<latexit sha1_base64="EOqUViW3eaQRlDRi0wywMhhOOhU=">AAAFNniczVRbaxNBFJ52V63xluqjLweDJsE07AZBXwpFEXzwIV6SFjLJMjuZbIbOXpiZFcJ2+6d88Xf41hcfFPHVn+DsJoaYVqUFwYFhD+fyfd85nB0/EVxpxznZ2LTsS5evbF2tXLt+4+at6vbtvopTSVmPxiKWBz5RTPCI9TTXgh0kkpHQF2zfP3xWxPffMal4HL3Vs4QNQxJEfMIp0cblbVsv+6MMJzxvqCbswoNjAMAh0VPfz57nWLCJHmCVhl6md918hHk00TPAAQlDMsr0jpuD9DQcY58HR6A814CoFhhEwOYjeTDVQ4wrv4GWpuAh/CToXJQAlgyvG6plYqab5hJY1Q1qyUyJyN7kOXQbqn40RykzC6T1pksFf5TWOVua8jqFUf8/RZ5T2yrBhSTCubfnH4pb7nq9+TdQQaJAMAOC6TjWKzitVaAy2DR7Lct0r1pz2k554LThLowaWpyuV/2IxzFNQxZpKohSA9dJ9DAjUnMqWF7BqWIJoYckYANjRiRkapiVv30O941nDJNYmhtpKL2rFRkJlZqFvsks5qLWY4XzrNgg1ZMnw4xHSapZROdEk1SAjqF4Q2DMJaNazIxBqORGK9ApkYRq89JUzBDc9ZZPG/1O23Xa7qtHtb2ni3FsobvoHmogFz1Ge+gF6qIeotZ768T6bH2xP9if7K/2t3nq5sai5g765djffwCC5qb0</latexit><latexit sha1_base64="EOqUViW3eaQRlDRi0wywMhhOOhU=">AAAFNniczVRbaxNBFJ52V63xluqjLweDJsE07AZBXwpFEXzwIV6SFjLJMjuZbIbOXpiZFcJ2+6d88Xf41hcfFPHVn+DsJoaYVqUFwYFhD+fyfd85nB0/EVxpxznZ2LTsS5evbF2tXLt+4+at6vbtvopTSVmPxiKWBz5RTPCI9TTXgh0kkpHQF2zfP3xWxPffMal4HL3Vs4QNQxJEfMIp0cblbVsv+6MMJzxvqCbswoNjAMAh0VPfz57nWLCJHmCVhl6md918hHk00TPAAQlDMsr0jpuD9DQcY58HR6A814CoFhhEwOYjeTDVQ4wrv4GWpuAh/CToXJQAlgyvG6plYqab5hJY1Q1qyUyJyN7kOXQbqn40RykzC6T1pksFf5TWOVua8jqFUf8/RZ5T2yrBhSTCubfnH4pb7nq9+TdQQaJAMAOC6TjWKzitVaAy2DR7Lct0r1pz2k554LThLowaWpyuV/2IxzFNQxZpKohSA9dJ9DAjUnMqWF7BqWIJoYckYANjRiRkapiVv30O941nDJNYmhtpKL2rFRkJlZqFvsks5qLWY4XzrNgg1ZMnw4xHSapZROdEk1SAjqF4Q2DMJaNazIxBqORGK9ApkYRq89JUzBDc9ZZPG/1O23Xa7qtHtb2ni3FsobvoHmogFz1Ge+gF6qIeotZ768T6bH2xP9if7K/2t3nq5sai5g765djffwCC5qb0</latexit><latexit sha1_base64="EOqUViW3eaQRlDRi0wywMhhOOhU=">AAAFNniczVRbaxNBFJ52V63xluqjLweDJsE07AZBXwpFEXzwIV6SFjLJMjuZbIbOXpiZFcJ2+6d88Xf41hcfFPHVn+DsJoaYVqUFwYFhD+fyfd85nB0/EVxpxznZ2LTsS5evbF2tXLt+4+at6vbtvopTSVmPxiKWBz5RTPCI9TTXgh0kkpHQF2zfP3xWxPffMal4HL3Vs4QNQxJEfMIp0cblbVsv+6MMJzxvqCbswoNjAMAh0VPfz57nWLCJHmCVhl6md918hHk00TPAAQlDMsr0jpuD9DQcY58HR6A814CoFhhEwOYjeTDVQ4wrv4GWpuAh/CToXJQAlgyvG6plYqab5hJY1Q1qyUyJyN7kOXQbqn40RykzC6T1pksFf5TWOVua8jqFUf8/RZ5T2yrBhSTCubfnH4pb7nq9+TdQQaJAMAOC6TjWKzitVaAy2DR7Lct0r1pz2k554LThLowaWpyuV/2IxzFNQxZpKohSA9dJ9DAjUnMqWF7BqWIJoYckYANjRiRkapiVv30O941nDJNYmhtpKL2rFRkJlZqFvsks5qLWY4XzrNgg1ZMnw4xHSapZROdEk1SAjqF4Q2DMJaNazIxBqORGK9ApkYRq89JUzBDc9ZZPG/1O23Xa7qtHtb2ni3FsobvoHmogFz1Ge+gF6qIeotZ768T6bH2xP9if7K/2t3nq5sai5g765djffwCC5qb0</latexit><latexit sha1_base64="EOqUViW3eaQRlDRi0wywMhhOOhU=">AAAFNniczVRbaxNBFJ52V63xluqjLweDJsE07AZBXwpFEXzwIV6SFjLJMjuZbIbOXpiZFcJ2+6d88Xf41hcfFPHVn+DsJoaYVqUFwYFhD+fyfd85nB0/EVxpxznZ2LTsS5evbF2tXLt+4+at6vbtvopTSVmPxiKWBz5RTPCI9TTXgh0kkpHQF2zfP3xWxPffMal4HL3Vs4QNQxJEfMIp0cblbVsv+6MMJzxvqCbswoNjAMAh0VPfz57nWLCJHmCVhl6md918hHk00TPAAQlDMsr0jpuD9DQcY58HR6A814CoFhhEwOYjeTDVQ4wrv4GWpuAh/CToXJQAlgyvG6plYqab5hJY1Q1qyUyJyN7kOXQbqn40RykzC6T1pksFf5TWOVua8jqFUf8/RZ5T2yrBhSTCubfnH4pb7nq9+TdQQaJAMAOC6TjWKzitVaAy2DR7Lct0r1pz2k554LThLowaWpyuV/2IxzFNQxZpKohSA9dJ9DAjUnMqWF7BqWIJoYckYANjRiRkapiVv30O941nDJNYmhtpKL2rFRkJlZqFvsks5qLWY4XzrNgg1ZMnw4xHSapZROdEk1SAjqF4Q2DMJaNazIxBqORGK9ApkYRq89JUzBDc9ZZPG/1O23Xa7qtHtb2ni3FsobvoHmogFz1Ge+gF6qIeotZ768T6bH2xP9if7K/2t3nq5sai5g765djffwCC5qb0</latexit>

Bellman equation for policy evaluation

37

V ⇡(s) = E" 1X

t=1

�t�1rt��s1 = s,⇡

#

= E"r1 +

1X

t=2

�t�1rt��s1 = s,⇡

#

= R(s,⇡(s)) +X

s02SP (s0|s,⇡(s))E

"�

1X

t=2

�t�2rt��s1 = s, s2 = s0,⇡

#

= R(s,⇡(s)) +X

s02SP (s0|s,⇡(s))E

"�

1X

t=2

�t�2rt��s2 = s0,⇡

#

= R(s,⇡(s)) + �X

s02SP (s0|s,⇡(s))E

" 1X

t=1

�t�1rt��s1 = s0,⇡

#

= R(s,⇡(s)) + �X

s02SP (s0|s,⇡(s))V ⇡(s0)

= R(s,⇡(s)) + �hP (·|s,⇡(s)), V ⇡(·)i<latexit sha1_base64="EOqUViW3eaQRlDRi0wywMhhOOhU=">AAAFNniczVRbaxNBFJ52V63xluqjLweDJsE07AZBXwpFEXzwIV6SFjLJMjuZbIbOXpiZFcJ2+6d88Xf41hcfFPHVn+DsJoaYVqUFwYFhD+fyfd85nB0/EVxpxznZ2LTsS5evbF2tXLt+4+at6vbtvopTSVmPxiKWBz5RTPCI9TTXgh0kkpHQF2zfP3xWxPffMal4HL3Vs4QNQxJEfMIp0cblbVsv+6MMJzxvqCbswoNjAMAh0VPfz57nWLCJHmCVhl6md918hHk00TPAAQlDMsr0jpuD9DQcY58HR6A814CoFhhEwOYjeTDVQ4wrv4GWpuAh/CToXJQAlgyvG6plYqab5hJY1Q1qyUyJyN7kOXQbqn40RykzC6T1pksFf5TWOVua8jqFUf8/RZ5T2yrBhSTCubfnH4pb7nq9+TdQQaJAMAOC6TjWKzitVaAy2DR7Lct0r1pz2k554LThLowaWpyuV/2IxzFNQxZpKohSA9dJ9DAjUnMqWF7BqWIJoYckYANjRiRkapiVv30O941nDJNYmhtpKL2rFRkJlZqFvsks5qLWY4XzrNgg1ZMnw4xHSapZROdEk1SAjqF4Q2DMJaNazIxBqORGK9ApkYRq89JUzBDc9ZZPG/1O23Xa7qtHtb2ni3FsobvoHmogFz1Ge+gF6qIeotZ768T6bH2xP9if7K/2t3nq5sai5g765djffwCC5qb0</latexit><latexit sha1_base64="EOqUViW3eaQRlDRi0wywMhhOOhU=">AAAFNniczVRbaxNBFJ52V63xluqjLweDJsE07AZBXwpFEXzwIV6SFjLJMjuZbIbOXpiZFcJ2+6d88Xf41hcfFPHVn+DsJoaYVqUFwYFhD+fyfd85nB0/EVxpxznZ2LTsS5evbF2tXLt+4+at6vbtvopTSVmPxiKWBz5RTPCI9TTXgh0kkpHQF2zfP3xWxPffMal4HL3Vs4QNQxJEfMIp0cblbVsv+6MMJzxvqCbswoNjAMAh0VPfz57nWLCJHmCVhl6md918hHk00TPAAQlDMsr0jpuD9DQcY58HR6A814CoFhhEwOYjeTDVQ4wrv4GWpuAh/CToXJQAlgyvG6plYqab5hJY1Q1qyUyJyN7kOXQbqn40RykzC6T1pksFf5TWOVua8jqFUf8/RZ5T2yrBhSTCubfnH4pb7nq9+TdQQaJAMAOC6TjWKzitVaAy2DR7Lct0r1pz2k554LThLowaWpyuV/2IxzFNQxZpKohSA9dJ9DAjUnMqWF7BqWIJoYckYANjRiRkapiVv30O941nDJNYmhtpKL2rFRkJlZqFvsks5qLWY4XzrNgg1ZMnw4xHSapZROdEk1SAjqF4Q2DMJaNazIxBqORGK9ApkYRq89JUzBDc9ZZPG/1O23Xa7qtHtb2ni3FsobvoHmogFz1Ge+gF6qIeotZ768T6bH2xP9if7K/2t3nq5sai5g765djffwCC5qb0</latexit><latexit sha1_base64="EOqUViW3eaQRlDRi0wywMhhOOhU=">AAAFNniczVRbaxNBFJ52V63xluqjLweDJsE07AZBXwpFEXzwIV6SFjLJMjuZbIbOXpiZFcJ2+6d88Xf41hcfFPHVn+DsJoaYVqUFwYFhD+fyfd85nB0/EVxpxznZ2LTsS5evbF2tXLt+4+at6vbtvopTSVmPxiKWBz5RTPCI9TTXgh0kkpHQF2zfP3xWxPffMal4HL3Vs4QNQxJEfMIp0cblbVsv+6MMJzxvqCbswoNjAMAh0VPfz57nWLCJHmCVhl6md918hHk00TPAAQlDMsr0jpuD9DQcY58HR6A814CoFhhEwOYjeTDVQ4wrv4GWpuAh/CToXJQAlgyvG6plYqab5hJY1Q1qyUyJyN7kOXQbqn40RykzC6T1pksFf5TWOVua8jqFUf8/RZ5T2yrBhSTCubfnH4pb7nq9+TdQQaJAMAOC6TjWKzitVaAy2DR7Lct0r1pz2k554LThLowaWpyuV/2IxzFNQxZpKohSA9dJ9DAjUnMqWF7BqWIJoYckYANjRiRkapiVv30O941nDJNYmhtpKL2rFRkJlZqFvsks5qLWY4XzrNgg1ZMnw4xHSapZROdEk1SAjqF4Q2DMJaNazIxBqORGK9ApkYRq89JUzBDc9ZZPG/1O23Xa7qtHtb2ni3FsobvoHmogFz1Ge+gF6qIeotZ768T6bH2xP9if7K/2t3nq5sai5g765djffwCC5qb0</latexit><latexit sha1_base64="EOqUViW3eaQRlDRi0wywMhhOOhU=">AAAFNniczVRbaxNBFJ52V63xluqjLweDJsE07AZBXwpFEXzwIV6SFjLJMjuZbIbOXpiZFcJ2+6d88Xf41hcfFPHVn+DsJoaYVqUFwYFhD+fyfd85nB0/EVxpxznZ2LTsS5evbF2tXLt+4+at6vbtvopTSVmPxiKWBz5RTPCI9TTXgh0kkpHQF2zfP3xWxPffMal4HL3Vs4QNQxJEfMIp0cblbVsv+6MMJzxvqCbswoNjAMAh0VPfz57nWLCJHmCVhl6md918hHk00TPAAQlDMsr0jpuD9DQcY58HR6A814CoFhhEwOYjeTDVQ4wrv4GWpuAh/CToXJQAlgyvG6plYqab5hJY1Q1qyUyJyN7kOXQbqn40RykzC6T1pksFf5TWOVua8jqFUf8/RZ5T2yrBhSTCubfnH4pb7nq9+TdQQaJAMAOC6TjWKzitVaAy2DR7Lct0r1pz2k554LThLowaWpyuV/2IxzFNQxZpKohSA9dJ9DAjUnMqWF7BqWIJoYckYANjRiRkapiVv30O941nDJNYmhtpKL2rFRkJlZqFvsks5qLWY4XzrNgg1ZMnw4xHSapZROdEk1SAjqF4Q2DMJaNazIxBqORGK9ApkYRq89JUzBDc9ZZPG/1O23Xa7qtHtb2ni3FsobvoHmogFz1Ge+gF6qIeotZ768T6bH2xP9if7K/2t3nq5sai5g765djffwCC5qb0</latexit>

Bellman equation for policy evaluation

38

Matrix form: define

• Vπ as the |S|×1 vector [Vπ(s)]s∈S

Vπ(s) = R(s, π(s)) + γ⟨P( ⋅ |s, π(s)), Vπ( ⋅ )⟩

Bellman equation for policy evaluation

38

Matrix form: define

• Vπ as the |S|×1 vector [Vπ(s)]s∈S

• Rπ as the vector [R(s, π(s))]s∈S

Vπ(s) = R(s, π(s)) + γ⟨P( ⋅ |s, π(s)), Vπ( ⋅ )⟩

Bellman equation for policy evaluation

38

Matrix form: define

• Vπ as the |S|×1 vector [Vπ(s)]s∈S

• Rπ as the vector [R(s, π(s))]s∈S

• Pπ as the matrix [P(s’|s, π(s))]s∈S, s’∈S

Vπ(s) = R(s, π(s)) + γ⟨P( ⋅ |s, π(s)), Vπ( ⋅ )⟩

Bellman equation for policy evaluation

38

Matrix form: define

• Vπ as the |S|×1 vector [Vπ(s)]s∈S

• Rπ as the vector [R(s, π(s))]s∈S

• Pπ as the matrix [P(s’|s, π(s))]s∈S, s’∈S

Vπ(s) = R(s, π(s)) + γ⟨P( ⋅ |s, π(s)), Vπ( ⋅ )⟩

Vπ = Rπ + γPπVπ

Bellman equation for policy evaluation

38

Matrix form: define

• Vπ as the |S|×1 vector [Vπ(s)]s∈S

• Rπ as the vector [R(s, π(s))]s∈S

• Pπ as the matrix [P(s’|s, π(s))]s∈S, s’∈S

Vπ(s) = R(s, π(s)) + γ⟨P( ⋅ |s, π(s)), Vπ( ⋅ )⟩

Vπ = Rπ + γPπVπ

(I − γPπ)Vπ = Rπ

Bellman equation for policy evaluation

38

Matrix form: define

• Vπ as the |S|×1 vector [Vπ(s)]s∈S

• Rπ as the vector [R(s, π(s))]s∈S

• Pπ as the matrix [P(s’|s, π(s))]s∈S, s’∈S

Vπ(s) = R(s, π(s)) + γ⟨P( ⋅ |s, π(s)), Vπ( ⋅ )⟩

Vπ = Rπ + γPπVπ

(I − γPπ)Vπ = Rπ

Vπ = (I − γPπ)−1Rπ

Bellman equation for policy evaluation

38

Matrix form: define

• Vπ as the |S|×1 vector [Vπ(s)]s∈S

• Rπ as the vector [R(s, π(s))]s∈S

• Pπ as the matrix [P(s’|s, π(s))]s∈S, s’∈S

Vπ(s) = R(s, π(s)) + γ⟨P( ⋅ |s, π(s)), Vπ( ⋅ )⟩

Vπ = Rπ + γPπVπ

(I − γPπ)Vπ = Rπ

Vπ = (I − γPπ)−1Rπ

This is always invertible. Proof?

State occupancy

39

(I − γPπ)−1

Each row (indexed by s) is the discounted state occupancy dsπ,

whose (s’)-th entry is

State occupancy

39

(I − γPπ)−1

Each row (indexed by s) is the discounted state occupancy dsπ,

whose (s’)-th entry is

State occupancy

39

(I − γPπ)−1

dπs (s′�) = 𝔼 [

∑t=1

γt−1 𝕀[st = s′�] s1 = s, π ]

Each row (indexed by s) is the discounted state occupancy dsπ,

whose (s’)-th entry is

• Each row is like a distribution vector—except that the entries sum up to 1/(1-γ). Let denote the normalized vector.

State occupancy

39

(I − γPπ)−1

dπs (s′�) = 𝔼 [

∑t=1

γt−1 𝕀[st = s′�] s1 = s, π ]ηπ

s = (1 − γ) dπs

Each row (indexed by s) is the discounted state occupancy dsπ,

whose (s’)-th entry is

• Each row is like a distribution vector—except that the entries sum up to 1/(1-γ). Let denote the normalized vector.

• Vπ(s) is the dot product between and reward vector

State occupancy

39

(I − γPπ)−1

dπs (s′�) = 𝔼 [

∑t=1

γt−1 𝕀[st = s′�] s1 = s, π ]ηπ

s = (1 − γ) dπs

dπs

Each row (indexed by s) is the discounted state occupancy dsπ,

whose (s’)-th entry is

• Each row is like a distribution vector—except that the entries sum up to 1/(1-γ). Let denote the normalized vector.

• Vπ(s) is the dot product between and reward vector

• Can also be interpreted as the value function of indicator reward function

State occupancy

39

(I − γPπ)−1

dπs (s′�) = 𝔼 [

∑t=1

γt−1 𝕀[st = s′�] s1 = s, π ]ηπ

s = (1 − γ) dπs

dπs

• For infinite-horizon discounted MDPs, there always exists a stationary and deterministic policy that is optimal for all starting states simultaneously• Proof: Puterman’94, Thm 6.2.7 (reference due to Shipra Agrawal)

Optimality

40

• For infinite-horizon discounted MDPs, there always exists a stationary and deterministic policy that is optimal for all starting states simultaneously• Proof: Puterman’94, Thm 6.2.7 (reference due to Shipra Agrawal)

• Let π* denote this optimal policy, and V* := Vπ*

Optimality

40

• For infinite-horizon discounted MDPs, there always exists a stationary and deterministic policy that is optimal for all starting states simultaneously• Proof: Puterman’94, Thm 6.2.7 (reference due to Shipra Agrawal)

• Let π* denote this optimal policy, and V* := Vπ*

• Bellman Optimality Equation:

Optimality

40

V⋆(s) = maxa∈A (R(s, a) + γ𝔼s′�∼P(s,a) [V⋆(s′�)])

• For infinite-horizon discounted MDPs, there always exists a stationary and deterministic policy that is optimal for all starting states simultaneously• Proof: Puterman’94, Thm 6.2.7 (reference due to Shipra Agrawal)

• Let π* denote this optimal policy, and V* := Vπ*

• Bellman Optimality Equation:

• If we know V*, how to get π* ?

Optimality

40

V⋆(s) = maxa∈A (R(s, a) + γ𝔼s′�∼P(s,a) [V⋆(s′�)])

• For infinite-horizon discounted MDPs, there always exists a stationary and deterministic policy that is optimal for all starting states simultaneously• Proof: Puterman’94, Thm 6.2.7 (reference due to Shipra Agrawal)

• Let π* denote this optimal policy, and V* := Vπ*

• Bellman Optimality Equation:

• If we know V*, how to get π* ?

Optimality

40

V⋆(s) = maxa∈A (R(s, a) + γ𝔼s′�∼P(s,a) [V⋆(s′�)])

• For infinite-horizon discounted MDPs, there always exists a stationary and deterministic policy that is optimal for all starting states simultaneously• Proof: Puterman’94, Thm 6.2.7 (reference due to Shipra Agrawal)

• Let π* denote this optimal policy, and V* := Vπ*

• Bellman Optimality Equation:

• If we know V*, how to get π* ?

• Easier to work with Q-values: Q*(s, a), as

Optimality

40

V⋆(s) = maxa∈A (R(s, a) + γ𝔼s′�∼P(s,a) [V⋆(s′�)])

π⋆(s) = arg maxa∈A

Q⋆(s, a)

• For infinite-horizon discounted MDPs, there always exists a stationary and deterministic policy that is optimal for all starting states simultaneously• Proof: Puterman’94, Thm 6.2.7 (reference due to Shipra Agrawal)

• Let π* denote this optimal policy, and V* := Vπ*

• Bellman Optimality Equation:

• If we know V*, how to get π* ?

• Easier to work with Q-values: Q*(s, a), as

Optimality

40

V⋆(s) = maxa∈A (R(s, a) + γ𝔼s′�∼P(s,a) [V⋆(s′�)])

π⋆(s) = arg maxa∈A

Q⋆(s, a)

Q⋆(s, a) = R(s, a) + γ𝔼s′�∼P(s,a) [maxa′�∈A

Q⋆(s′�, a′�)]

Ad Hoc Homework 1

41

• uploaded on course website

Ad Hoc Homework 1

41

• uploaded on course website

• help understand the relationships between alternative MDP formulations

Ad Hoc Homework 1

41

• uploaded on course website

• help understand the relationships between alternative MDP formulations

• more like readings w/ questions to think about

Ad Hoc Homework 1

41

• uploaded on course website

• help understand the relationships between alternative MDP formulations

• more like readings w/ questions to think about

• no need to submit

Ad Hoc Homework 1

41

• uploaded on course website

• help understand the relationships between alternative MDP formulations

• more like readings w/ questions to think about

• no need to submit

• will go through in class next week

Ad Hoc Homework 1

41