202
CS 598 Statistical Reinforcement Learning Nan Jiang

CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

CS 598 Statistical Reinforcement Learning

Nan Jiang

Page 2: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Overview

Page 3: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

What’s this course about?

• A grad-level seminar course on theory of RL

3

Page 4: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

What’s this course about?

• A grad-level seminar course on theory of RL• with focus on sample complexity analyses

3

Page 5: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

What’s this course about?

• A grad-level seminar course on theory of RL• with focus on sample complexity analyses• all about proofs, some perspectives, 0 implementation

3

Page 6: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

What’s this course about?

• A grad-level seminar course on theory of RL• with focus on sample complexity analyses• all about proofs, some perspectives, 0 implementation• Seminar course can be anywhere between students presenting

papers and a rigorous course

3

Page 7: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

What’s this course about?

• A grad-level seminar course on theory of RL• with focus on sample complexity analyses• all about proofs, some perspectives, 0 implementation• Seminar course can be anywhere between students presenting

papers and a rigorous course • In this course I will deliver most (or all) of the lectures

3

Page 8: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

What’s this course about?

• A grad-level seminar course on theory of RL• with focus on sample complexity analyses• all about proofs, some perspectives, 0 implementation• Seminar course can be anywhere between students presenting

papers and a rigorous course • In this course I will deliver most (or all) of the lectures• No text book; material is created by myself (course notes)

3

Page 9: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

What’s this course about?

• A grad-level seminar course on theory of RL• with focus on sample complexity analyses• all about proofs, some perspectives, 0 implementation• Seminar course can be anywhere between students presenting

papers and a rigorous course • In this course I will deliver most (or all) of the lectures• No text book; material is created by myself (course notes)• will share slides/notes on course website

3

Page 10: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Who should take this course?

• This course will be a good fit for you if you…

4

Page 11: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Who should take this course?

• This course will be a good fit for you if you…• are interested in understanding RL mathematically

4

Page 12: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Who should take this course?

• This course will be a good fit for you if you…• are interested in understanding RL mathematically• want to do research in theory of RL or related area

4

Page 13: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Who should take this course?

• This course will be a good fit for you if you…• are interested in understanding RL mathematically• want to do research in theory of RL or related area• want to work with me

4

Page 14: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Who should take this course?

• This course will be a good fit for you if you…• are interested in understanding RL mathematically• want to do research in theory of RL or related area• want to work with me• are comfortable with maths

4

Page 15: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Who should take this course?

• This course will be a good fit for you if you…• are interested in understanding RL mathematically• want to do research in theory of RL or related area• want to work with me• are comfortable with maths

• This course will not be a good fit if you…

4

Page 16: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Who should take this course?

• This course will be a good fit for you if you…• are interested in understanding RL mathematically• want to do research in theory of RL or related area• want to work with me• are comfortable with maths

• This course will not be a good fit if you…• are mostly interested in implementing RL algorithms and

making things to work we won’t cover any engineering tricks (which are essential to e.g., Deep RL); check other RL courses offered across the campus

4

Page 17: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Prerequisites

• Maths

5

Page 18: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Prerequisites

• Maths• Linear algebra, probability & statistics, basic calculus

5

Page 19: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Prerequisites

• Maths• Linear algebra, probability & statistics, basic calculus• Markov chains

5

Page 20: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Prerequisites

• Maths• Linear algebra, probability & statistics, basic calculus• Markov chains• Optional: stochastic processes, numerical analysis

5

Page 21: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Prerequisites

• Maths• Linear algebra, probability & statistics, basic calculus• Markov chains• Optional: stochastic processes, numerical analysis• Useful for research: TCS background, empirical processes

and statistical learning theory, optimization, online learning

5

Page 22: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Prerequisites

• Maths• Linear algebra, probability & statistics, basic calculus• Markov chains• Optional: stochastic processes, numerical analysis• Useful for research: TCS background, empirical processes

and statistical learning theory, optimization, online learning• Exposure to ML

5

Page 23: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Prerequisites

• Maths• Linear algebra, probability & statistics, basic calculus• Markov chains• Optional: stochastic processes, numerical analysis• Useful for research: TCS background, empirical processes

and statistical learning theory, optimization, online learning• Exposure to ML

• e.g., CS 446 Machine Learning

5

Page 24: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Prerequisites

• Maths• Linear algebra, probability & statistics, basic calculus• Markov chains• Optional: stochastic processes, numerical analysis• Useful for research: TCS background, empirical processes

and statistical learning theory, optimization, online learning• Exposure to ML

• e.g., CS 446 Machine Learning• Experience with RL

5

Page 25: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Coursework

• Some readings after/before class

6

Page 26: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Coursework

• Some readings after/before class• Ad hoc homework to help digest particular material. Deadline

will be lenient & TBA at the time of assignment

6

Page 27: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Coursework

• Some readings after/before class• Ad hoc homework to help digest particular material. Deadline

will be lenient & TBA at the time of assignment• Main assignment: course project (work on your own)

6

Page 28: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Coursework

• Some readings after/before class• Ad hoc homework to help digest particular material. Deadline

will be lenient & TBA at the time of assignment• Main assignment: course project (work on your own)

• Baseline: reproduce theoretical analysis in existing papers

6

Page 29: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Coursework

• Some readings after/before class• Ad hoc homework to help digest particular material. Deadline

will be lenient & TBA at the time of assignment• Main assignment: course project (work on your own)

• Baseline: reproduce theoretical analysis in existing papers• Advanced: identify an interesting/challenging extension to

the paper and explore the novel research question yourself

6

Page 30: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Coursework

• Some readings after/before class• Ad hoc homework to help digest particular material. Deadline

will be lenient & TBA at the time of assignment• Main assignment: course project (work on your own)

• Baseline: reproduce theoretical analysis in existing papers• Advanced: identify an interesting/challenging extension to

the paper and explore the novel research question yourself• Or, just work on a novel research question (must have a

significant theoretical component; need to discuss with me)

6

Page 31: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Course project (cont.)

• See list of references and potential topics on website

7

Page 32: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Course project (cont.)

• See list of references and potential topics on website• You will need to submit:

7

Page 33: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Course project (cont.)

• See list of references and potential topics on website• You will need to submit:

• A brief proposal (~1/2 page). Tentative deadline: end of Feb

7

Page 34: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Course project (cont.)

• See list of references and potential topics on website• You will need to submit:

• A brief proposal (~1/2 page). Tentative deadline: end of Feb• what’s the topic and what papers you plan to work on

7

Page 35: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Course project (cont.)

• See list of references and potential topics on website• You will need to submit:

• A brief proposal (~1/2 page). Tentative deadline: end of Feb• what’s the topic and what papers you plan to work on• why you choose the topic: what interest you?

7

Page 36: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Course project (cont.)

• See list of references and potential topics on website• You will need to submit:

• A brief proposal (~1/2 page). Tentative deadline: end of Feb• what’s the topic and what papers you plan to work on• why you choose the topic: what interest you?• which aspect(s) you will focus on?

7

Page 37: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Course project (cont.)

• See list of references and potential topics on website• You will need to submit:

• A brief proposal (~1/2 page). Tentative deadline: end of Feb• what’s the topic and what papers you plan to work on• why you choose the topic: what interest you?• which aspect(s) you will focus on?

• Final report: clarity, precision, and brevity are greatly valued. More details to come…

7

Page 38: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Course project (cont.)

• See list of references and potential topics on website• You will need to submit:

• A brief proposal (~1/2 page). Tentative deadline: end of Feb• what’s the topic and what papers you plan to work on• why you choose the topic: what interest you?• which aspect(s) you will focus on?

• Final report: clarity, precision, and brevity are greatly valued. More details to come…

• All docs should be in pdf. Final report should be prepared using LaTeX.

7

Page 39: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Course project (cont. 2)

Rule of thumb

8

Page 40: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Course project (cont. 2)

Rule of thumb1. learn something that interests you

8

Page 41: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Course project (cont. 2)

Rule of thumb1. learn something that interests you2. teach me something! (I wouldn’t learn if I could not

understand your report due to lack of clarity)

8

Page 42: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Course project (cont. 2)

Rule of thumb1. learn something that interests you2. teach me something! (I wouldn’t learn if I could not

understand your report due to lack of clarity)3. write a report similar to (or better than!) the notes I will share

with you

8

Page 43: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Contents of the course

• many important topics in RL will not be covered in depth (e.g., TD). read more if you want to get a more comprehensive view of RL

9

Page 44: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Contents of the course

• many important topics in RL will not be covered in depth (e.g., TD). read more if you want to get a more comprehensive view of RL

• the other opportunity to learn what’s not covered in lectures is the project, as potential topics for projects are much broader than what’s covered in class.

9

Page 45: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

My goals

• Encourage you to do research in this area

10

Page 46: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

My goals

• Encourage you to do research in this area• Introduce useful mathematical tools to you

10

Page 47: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

My goals

• Encourage you to do research in this area• Introduce useful mathematical tools to you• Learn from you

10

Page 48: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

• Course website: http://nanjiang.cs.illinois.edu/cs598/

11

Logistics

Page 49: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

• Course website: http://nanjiang.cs.illinois.edu/cs598/• logistics, links to slides/notes (uploaded after lectures),

and resources (e.g., textbooks to consult, related courses), deadlines & announcements

11

Logistics

Page 50: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

• Course website: http://nanjiang.cs.illinois.edu/cs598/• logistics, links to slides/notes (uploaded after lectures),

and resources (e.g., textbooks to consult, related courses), deadlines & announcements

• Time & Location: Tue & Thu 12:30-1:45pm, 1304 Siebel.

11

Logistics

Page 51: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

• Course website: http://nanjiang.cs.illinois.edu/cs598/• logistics, links to slides/notes (uploaded after lectures),

and resources (e.g., textbooks to consult, related courses), deadlines & announcements

• Time & Location: Tue & Thu 12:30-1:45pm, 1304 Siebel.• TA: Jinglin Chen (jinglinc)

11

Logistics

Page 52: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

• Course website: http://nanjiang.cs.illinois.edu/cs598/• logistics, links to slides/notes (uploaded after lectures),

and resources (e.g., textbooks to consult, related courses), deadlines & announcements

• Time & Location: Tue & Thu 12:30-1:45pm, 1304 Siebel.• TA: Jinglin Chen (jinglinc)• Office hours: By appointment. 3322 Siebel.

11

Logistics

Page 53: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

• Course website: http://nanjiang.cs.illinois.edu/cs598/• logistics, links to slides/notes (uploaded after lectures),

and resources (e.g., textbooks to consult, related courses), deadlines & announcements

• Time & Location: Tue & Thu 12:30-1:45pm, 1304 Siebel.• TA: Jinglin Chen (jinglinc)• Office hours: By appointment. 3322 Siebel.

• Questions about material: ad hoc meetings after lectures subject to my availability

11

Logistics

Page 54: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

• Course website: http://nanjiang.cs.illinois.edu/cs598/• logistics, links to slides/notes (uploaded after lectures),

and resources (e.g., textbooks to consult, related courses), deadlines & announcements

• Time & Location: Tue & Thu 12:30-1:45pm, 1304 Siebel.• TA: Jinglin Chen (jinglinc)• Office hours: By appointment. 3322 Siebel.

• Questions about material: ad hoc meetings after lectures subject to my availability

• Other things about the class (e.g., project): by appointment

11

Logistics

Page 55: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Introduction to MDPs and RL

Page 56: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

13

Reinforcement Learning (RL)Applications

Page 57: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

13

[Levine et al’16] [Ng et al’03]

Reinforcement Learning (RL)Applications

Page 58: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

13

[Levine et al’16] [Ng et al’03] [Singh et al’02]

Reinforcement Learning (RL)Applications

Page 59: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

13

[Levine et al’16] [Ng et al’03] [Singh et al’02]

Reinforcement Learning (RL)

[Lei et al’12]

Applications

Page 60: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

13

[Levine et al’16] [Ng et al’03][Mandel et al’16]

[Singh et al’02]

Reinforcement Learning (RL)

[Lei et al’12]

Applications

Page 61: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

13

[Levine et al’16] [Ng et al’03][Mandel et al’16]

[Singh et al’02]

Reinforcement Learning (RL)

[Lei et al’12][Tesauro et al’07]

Applications

Page 62: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

13

[Levine et al’16] [Ng et al’03][Mandel et al’16]

[Singh et al’02]

Reinforcement Learning (RL)

[Mnih et al’15][Silver et al’16][Lei et al’12]

[Tesauro et al’07]

Applications

Page 63: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

14

s0

b

c

d

e

f

g1

2

3

2

1

3

14

1

Shortest Path

Page 64: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

14

s0

b

c

d

e

f

g1

2

3

2

1

3

14

1

Shortest Path

state

action

Page 65: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

14

s0

b

c

d

e

f

g1

2

3

2

1

3

14

Greedy is suboptimal due to delayed effects

1

Need long-term planning

Shortest Path

state

action

Page 66: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

14

s0

b

c

d

e

f

g1

2

3

2

1

3

14

Greedy is suboptimal due to delayed effects

1

V*(g) = 0

V*(f) = 1

Need long-term planning

Shortest Path

state

action

Page 67: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

14

s0

b

c

d

e

f

g1

2

3

2

1

3

14

Greedy is suboptimal due to delayed effects

1

V*(g) = 0

V*(f) = 1

V*(d) = min{3 + V*(g) , 2 + V*(f)}

Need long-term planning

Shortest Path

state

action

Bellman Equation

Page 68: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

14

s0

b

c

d

e

f

g1

2

3

2

1

3

14

Greedy is suboptimal due to delayed effects

1

V*(g) = 0

V*(f) = 1

V*(d) = 2

V*(d) = min{3 + V*(g) , 2 + V*(f)}

Need long-term planning

Shortest Path

state

action

Bellman Equation

Page 69: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

14

s0

b

c

d

e

f

g1

2

3

2

1

3

14

Greedy is suboptimal due to delayed effects

1

V*(g) = 0

V*(f) = 1

V*(d) = 2

V*(e) = 2V*(c) = 4

V*(b) = 5

V*(s0) = 6

V*(d) = min{3 + V*(g) , 2 + V*(f)}

Need long-term planning

Shortest Path

Bellman Equation

Page 70: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

14

s0

b

c

d

e

f

g1

2

3

2

1

3

14

Greedy is suboptimal due to delayed effects

1

V*(g) = 0

V*(f) = 1

V*(d) = 2

V*(e) = 2V*(c) = 4

V*(b) = 5

V*(s0) = 6

V*(d) = min{3 + V*(g) , 2 + V*(f)}

Need long-term planning

Shortest Path

Bellman Equation

Page 71: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

15

s0

b

c

d

e

f

g1

2

2

1

3

14

1

Shortest Path

4

Page 72: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

16

s0

b

c

d

e

f

g0.7

0.3

Stochastic

1

2

2

1

3

14

1

Shortest Path

e

0.50.5

4

Page 73: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

16

s0

b

c

d

e

f

g0.7

0.3

Stochastic

1

2

2

1

3

14

1

Shortest Path

transition distribution

e

0.50.5

Markov Decision Process (MDP)

4

state

action

Page 74: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

1

17

s0

b

c

d

e

f

g

0.50.5

1

2

4

2

1

3

14

0.7

0.3

Stochastic Shortest Path

Greedy is suboptimal due to delayed effectsNeed long-term planning

Page 75: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

1

17

s0

b

c

d

e

f

g

0.50.5

1

2

4

2

1

3

14

0.7

0.3

V*(c) = min{4 + 0.7 × V*(d) + 0.3 × V*(e) , 2 + V*(e)}

Stochastic Shortest Path

Bellman Equation

Greedy is suboptimal due to delayed effectsNeed long-term planning

Page 76: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

1

17

s0

b

c

d

e

f

g

0.50.5

1

2

4

2

1

3

14

0.7

0.3

V*(c) = min{4 + 0.7 × V*(d) + 0.3 × V*(e) , 2 + V*(e)}

Stochastic Shortest Path

V*(g) = 0

Bellman Equation

Greedy is suboptimal due to delayed effectsNeed long-term planning

Page 77: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

1

17

s0

b

c

d

e

f

g

0.50.5

1

2

4

2

1

3

14

0.7

0.3

V*(c) = min{4 + 0.7 × V*(d) + 0.3 × V*(e) , 2 + V*(e)}

Stochastic Shortest Path

V*(g) = 0

V*(f) = 1

V*(d) = 2

V*(e) = 2.5V*(c) = 4.5

V*(b) = 6

V*(s0) = 6.5

Bellman Equation

Greedy is suboptimal due to delayed effectsNeed long-term planning

Page 78: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

1

17

s0

b

c

d

e

f

g

0.50.5

1

2

4

2

1

3

14

0.7

0.3

V*(c) = min{4 + 0.7 × V*(d) + 0.3 × V*(e) , 2 + V*(e)}

Stochastic Shortest Path

✓: optimal policy π*

V*(g) = 0

V*(f) = 1

V*(d) = 2

V*(e) = 2.5V*(c) = 4.5

V*(b) = 6

V*(s0) = 6.5

Bellman Equation

Greedy is suboptimal due to delayed effectsNeed long-term planning

✓✓

✓ ✓

Page 79: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

?

18

s0

b

c

d

e

f

g

1

1

2

4

2

1

3

14 0.5

0.5

0.7

0.3

via trial-and-errorStochastic Shortest Path

Page 80: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Trajectory 1: s0↘ c ↗ d → g

18

s0

b

c

d

e

f

g

1

1

2

4

2

1

3

14 0.5

0.5

0.7

0.3

via trial-and-errorStochastic Shortest Path

Page 81: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Trajectory 1: s0↘ c ↗ d → g

18

s0

b

c

d

e

f

g

1

1

2

4

2

1

3

14 0.5

0.5

0.7

0.3

via trial-and-errorStochastic Shortest Path

Page 82: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Trajectory 1: s0↘ c ↗ d → g

18

s0

b

c

d

e

f

g

1

1

2

4

2

1

3

14 0.5

0.5

0.7

0.3

via trial-and-errorStochastic Shortest Path

Page 83: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Trajectory 1: s0↘ c ↗ d → g

18

s0

b

c

d

e

f

g

1

1

2

4

2

1

3

14 0.5

0.5

0.7

0.3

via trial-and-errorStochastic Shortest Path

Page 84: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Trajectory 1: s0↘ c ↗ d → g

18

s0

b

c

d

e

f

g

1

1

2

4

2

1

3

14 0.5

0.5

0.7

0.3

via trial-and-errorStochastic Shortest Path

Page 85: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Trajectory 1: s0↘ c ↗ d → g

18

s0

b

c

d

e

f

g

1

1

2

4

2

1

3

14 0.5

0.5

0.7

0.3

via trial-and-errorStochastic Shortest Path

Page 86: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

1

2

4

2

1

3

14

1

19

s0

b

c

d

e

f

g

0.50.5

0.7

0.3

Trajectory 2: s0↘ c ↗ e → f ↗ g

via trial-and-error

Trajectory 1: s0↘ c ↗ d → g

Shortest PathStochastic

Page 87: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

1

2

4

2

1

3

14

1

19

s0

b

c

d

e

f

g

0.50.5

0.7

0.3s0

c ee

f

g

Trajectory 2: s0↘ c ↗ e → f ↗ g

via trial-and-error

Trajectory 1: s0↘ c ↗ d → g

Shortest PathStochastic

Page 88: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

20

s0

b

c

d

e

f

g1

2

4

2

1

3

14

1

via trial-and-error

Trajectory 2: s0↘ c ↗ e → f ↗ g

Trajectory 1: s0↘ c ↗ d → g

Shortest PathStochastic

Page 89: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

20

s0

b

c

d

e

f

g1

2

4

2

1

3

14

1

via trial-and-error

Trajectory 2: s0↘ c ↗ e → f ↗ g

Trajectory 1: s0↘ c ↗ d → g

Shortest PathStochastic

Page 90: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

20

0.72

0.28s0

b

c

d

e

f

g1

2

4

2

1

3

14

1

via trial-and-error

Trajectory 2: s0↘ c ↗ e → f ↗ g

Trajectory 1: s0↘ c ↗ d → g

Shortest PathStochastic

Page 91: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

20

0.72

0.28s0

b

c

d

e

f

g

0.55

0.45

1

2

4

2

1

3

14

1

via trial-and-error

Trajectory 2: s0↘ c ↗ e → f ↗ g

Trajectory 1: s0↘ c ↗ d → g

Shortest PathStochastic

Page 92: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

20

0.72

0.28s0

b

c

d

e

f

g

0.55

0.45

1

2

4

2

1

3

14

1

via trial-and-error

Trajectory 2: s0↘ c ↗ e → f ↗ g

Trajectory 1: s0↘ c ↗ d → g

Shortest PathStochastic

Model-based RL

Page 93: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

20

0.72

0.28s0

b

c

d

e

f

g

0.55

0.45

1

2

4

2

1

3

14

1

via trial-and-error

How many trajectories do we need to compute a near-optimal policy?

Trajectory 2: s0↘ c ↗ e → f ↗ g

Trajectory 1: s0↘ c ↗ d → g

Shortest Path

sample complexity

Stochastic

Model-based RL

Page 94: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

21

s0

b

c

d

e

f

g1

2

4

2

1

3

14

1

via trial-and-error

How many trajectories do we need to compute a near-optimal policy?

• Assume states & actions are visited uniformly

Shortest PathStochastic

Page 95: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

21

s0

b

c

d

e

f

g1

2

4

2

1

3

14

1

via trial-and-error

How many trajectories do we need to compute a near-optimal policy?

• Assume states & actions are visited uniformly• #trajectories needed ≤ n ⋅ (#state-action pairs)

#samples needed to estimate a multinomial distribution

Shortest PathStochastic

Page 96: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

21

via trial-and-error

How many trajectories do we need to compute a near-optimal policy?

• Assume states & actions are visited uniformly• #trajectories needed ≤ n ⋅ (#state-action pairs)

#samples needed to estimate a multinomial distribution

s0

b

c

d

e

f

g1

2

2

1

3

14

1

?

??

?

Shortest PathStochastic

4

Page 97: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

21

via trial-and-error

How many trajectories do we need to compute a near-optimal policy?

• Assume states & actions are visited uniformly• #trajectories needed ≤ n ⋅ (#state-action pairs)

#samples needed to estimate a multinomial distribution

s0

b

c

d

e

f

g1

2

2

1

3

14

1

Nontrivial! Need exploration

?

??

?

Shortest PathStochastic

4

Page 98: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

22

Video game playing

state st ∈ S

Page 99: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

22

Video game playing

action at ∈ A

+20

reward rt = R (st , at)

state st ∈ S

Page 100: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

22

Video game playing

e.g., random spawn of enemies

action at ∈ A

+20

reward rt = R (st , at)

transition dynamics P ( ⋅ | st , at)

state st ∈ S

(unknown)

Page 101: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

22

Video game playing

e.g., random spawn of enemies

action at ∈ A

+20

reward rt = R (st , at)

transition dynamics P ( ⋅ | st , at)

state st ∈ S

policy π: S → A

objective: maximize (or ) E⇥P

H

t=1 rt |⇡⇤

E⇥P1

t=1 �t�1rt |⇡

(unknown)

Page 102: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

22

Video game playing

5 10 15

γ = 0.8

γ = 0.9

5 10 15

H = 5

H = 10

e.g., random spawn of enemies

action at ∈ A

+20

reward rt = R (st , at)

transition dynamics P ( ⋅ | st , at)

state st ∈ S

policy π: S → A

objective: maximize (or ) E⇥P

H

t=1 rt |⇡⇤

E⇥P1

t=1 �t�1rt |⇡

(unknown)

Page 103: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

23

Video game playing

Page 104: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

23

Need generalization

Video game playing

Page 105: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

23

Need generalization

Video game playing

Value function approximation

Page 106: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

24

f (x;θ)

state features x

Need generalization

Video game playing

Value function approximation

Page 107: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

f (x;θ)

24

f (x;θ)

state features x

state features x

θNeed generalization

Video game playing

Value function approximation

Page 108: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

f (x;θ)

24

f (x;θ)

state features x

state features x

θFind θ s.t. f (x;θ) ≈ r + γ ⋅ Ex’|x [ f (x’;θ)]

Need generalization

⇒ f (⋅ ; θ) ≈ V*

Video game playing

Value function approximation

Page 109: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

f (x;θ)

24

f (x;θ)

state features x

state features x

θ

x

x’

+r

Find θ s.t. f (x;θ) ≈ r + γ ⋅ Ex’|x [ f (x’;θ)]…

Need generalization

⇒ f (⋅ ; θ) ≈ V*

Video game playing

Value function approximation

Page 110: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

25

value

state features x

Adaptive medical treatment

• State: diagnosis • Action: treatment • Reward: progress in recovery

Page 111: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

A Machine Learning view of RL

Page 112: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

27

Lecture 1: Introduction to Reinforcement Learning

About RL

Many Faces of Reinforcement Learning

Computer Science

Economics

Mathematics

Engineering Neuroscience

Psychology

Machine Learning

Classical/OperantConditioning

Optimal Control

RewardSystem

Operations Research

BoundedRationality

Reinforcement Learning

slide credit: David Silver

Page 113: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Supervised Learning

Given {(x(i), y(i))}, learn f : x ↦ y

28

Page 114: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Supervised Learning

Given {(x(i), y(i))}, learn f : x ↦ y• Online version: for round t = 1, 2,…, the learner

28

Page 115: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Supervised Learning

Given {(x(i), y(i))}, learn f : x ↦ y• Online version: for round t = 1, 2,…, the learner

• observes x(t)

28

Page 116: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

^

Supervised Learning

Given {(x(i), y(i))}, learn f : x ↦ y• Online version: for round t = 1, 2,…, the learner

• observes x(t)

• predicts y(t)

28

Page 117: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

^

Supervised Learning

Given {(x(i), y(i))}, learn f : x ↦ y• Online version: for round t = 1, 2,…, the learner

• observes x(t)

• predicts y(t)

• receives y(t)

28

Page 118: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

^

Supervised Learning

Given {(x(i), y(i))}, learn f : x ↦ y• Online version: for round t = 1, 2,…, the learner

• observes x(t)

• predicts y(t)

• receives y(t)

• Want to maximize # of correct predictions

28

Page 119: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

^

Supervised Learning

Given {(x(i), y(i))}, learn f : x ↦ y• Online version: for round t = 1, 2,…, the learner

• observes x(t)

• predicts y(t)

• receives y(t)

• Want to maximize # of correct predictions• e.g., classifies if an image is about a dog, a cat, a plane, etc.

(multi-class classification)

28

Page 120: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

^

Supervised Learning

Given {(x(i), y(i))}, learn f : x ↦ y• Online version: for round t = 1, 2,…, the learner

• observes x(t)

• predicts y(t)

• receives y(t)

• Want to maximize # of correct predictions• e.g., classifies if an image is about a dog, a cat, a plane, etc.

(multi-class classification)• Dataset is fixed for everyone

28

Page 121: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

^

Supervised Learning

Given {(x(i), y(i))}, learn f : x ↦ y• Online version: for round t = 1, 2,…, the learner

• observes x(t)

• predicts y(t)

• receives y(t)

• Want to maximize # of correct predictions• e.g., classifies if an image is about a dog, a cat, a plane, etc.

(multi-class classification)• Dataset is fixed for everyone• “Full information setting”

28

Page 122: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

^

Supervised Learning

Given {(x(i), y(i))}, learn f : x ↦ y• Online version: for round t = 1, 2,…, the learner

• observes x(t)

• predicts y(t)

• receives y(t)

• Want to maximize # of correct predictions• e.g., classifies if an image is about a dog, a cat, a plane, etc.

(multi-class classification)• Dataset is fixed for everyone• “Full information setting”• Core challenge: generalization

28

Page 123: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Contextual bandits

For round t = 1, 2,…, the learner

29

Page 124: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Contextual bandits

For round t = 1, 2,…, the learner• Given xi , chooses from a set of actions ai ∈ A

29

Page 125: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Contextual bandits

For round t = 1, 2,…, the learner• Given xi , chooses from a set of actions ai ∈ A• Receives reward r(t) ~ R(x(t) , a(t)) (i.e., can be random)

29

Page 126: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Contextual bandits

For round t = 1, 2,…, the learner• Given xi , chooses from a set of actions ai ∈ A• Receives reward r(t) ~ R(x(t) , a(t)) (i.e., can be random)• Want to maximize total reward

29

Page 127: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Contextual bandits

For round t = 1, 2,…, the learner• Given xi , chooses from a set of actions ai ∈ A• Receives reward r(t) ~ R(x(t) , a(t)) (i.e., can be random)• Want to maximize total reward• You generate your own dataset {(x(t) , a(t) , r(t))}!

29

Page 128: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Contextual bandits

For round t = 1, 2,…, the learner• Given xi , chooses from a set of actions ai ∈ A• Receives reward r(t) ~ R(x(t) , a(t)) (i.e., can be random)• Want to maximize total reward• You generate your own dataset {(x(t) , a(t) , r(t))}!• e.g., for an image, the learner guesses a label, and is told

whether correct or not (reward = 1 if correct and 0 otherwise). Do not know what’s the true label.

29

Page 129: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Contextual bandits

For round t = 1, 2,…, the learner• Given xi , chooses from a set of actions ai ∈ A• Receives reward r(t) ~ R(x(t) , a(t)) (i.e., can be random)• Want to maximize total reward• You generate your own dataset {(x(t) , a(t) , r(t))}!• e.g., for an image, the learner guesses a label, and is told

whether correct or not (reward = 1 if correct and 0 otherwise). Do not know what’s the true label.

• e.g., for an user, the website recommends a movie, and observes whether the user likes it or not. Do not know what movies the user really want to see.

29

Page 130: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Contextual bandits

For round t = 1, 2,…, the learner• Given xi , chooses from a set of actions ai ∈ A• Receives reward r(t) ~ R(x(t) , a(t)) (i.e., can be random)• Want to maximize total reward• You generate your own dataset {(x(t) , a(t) , r(t))}!• e.g., for an image, the learner guesses a label, and is told

whether correct or not (reward = 1 if correct and 0 otherwise). Do not know what’s the true label.

• e.g., for an user, the website recommends a movie, and observes whether the user likes it or not. Do not know what movies the user really want to see.

• “Partial information setting”

29

Page 131: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Contextual bandits

Contextual Bandits (cont.)

30

Page 132: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Contextual bandits

Contextual Bandits (cont.)• Simplification: no x, Multi-Armed Bandits (MAB)

30

Page 133: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Contextual bandits

Contextual Bandits (cont.)• Simplification: no x, Multi-Armed Bandits (MAB)• Bandit is a research area by itself. we will not do a lot of bandits

but may go through some material that have important implications on general RL (e.g., lower bounds)

30

Page 134: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

RL

For round t = 1, 2,…,

31

Page 135: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

RL

For round t = 1, 2,…,

• For time step h=1, 2, …, H, the learner

31

Page 136: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

RL

For round t = 1, 2,…,

• For time step h=1, 2, …, H, the learner

• Observes xh(t)

31

Page 137: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

RL

For round t = 1, 2,…,

• For time step h=1, 2, …, H, the learner

• Observes xh(t)

• Chooses ah(t)

31

Page 138: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

RL

For round t = 1, 2,…,

• For time step h=1, 2, …, H, the learner

• Observes xh(t)

• Chooses ah(t)

• Receives rh(t) ~ R(xh(t) , ah(t))

31

Page 139: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

RL

For round t = 1, 2,…,

• For time step h=1, 2, …, H, the learner

• Observes xh(t)

• Chooses ah(t)

• Receives rh(t) ~ R(xh(t) , ah(t))

• Next xh+1(t) is generated as a function of xh(t) and ah(t)

(or sometimes, all previous x’s and a’s within round t)

31

Page 140: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

RL

For round t = 1, 2,…,

• For time step h=1, 2, …, H, the learner

• Observes xh(t)

• Chooses ah(t)

• Receives rh(t) ~ R(xh(t) , ah(t))

• Next xh+1(t) is generated as a function of xh(t) and ah(t)

(or sometimes, all previous x’s and a’s within round t)

• Bandits + “Delayed rewards/consequences”

31

Page 141: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

RL

For round t = 1, 2,…,

• For time step h=1, 2, …, H, the learner

• Observes xh(t)

• Chooses ah(t)

• Receives rh(t) ~ R(xh(t) , ah(t))

• Next xh+1(t) is generated as a function of xh(t) and ah(t)

(or sometimes, all previous x’s and a’s within round t)

• Bandits + “Delayed rewards/consequences”

• The protocol here is for episodic RL (each t is an episode).

31

Page 142: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Why statistical RL?

32

Page 143: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Two types of scenarios in RL research

Why statistical RL?

32

Page 144: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Two types of scenarios in RL research

1. Solving a large planning problem using a learning approach

Why statistical RL?

32

Page 145: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Two types of scenarios in RL research

1. Solving a large planning problem using a learning approach- e.g., AlphaGo, video game playing, simulated robotics

Why statistical RL?

32

Page 146: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Two types of scenarios in RL research

1. Solving a large planning problem using a learning approach- e.g., AlphaGo, video game playing, simulated robotics- Transition dynamics (Go rules) known, but too many states

Why statistical RL?

32

Page 147: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Two types of scenarios in RL research

1. Solving a large planning problem using a learning approach- e.g., AlphaGo, video game playing, simulated robotics- Transition dynamics (Go rules) known, but too many states- Run the simulator to collect data

Why statistical RL?

32

Page 148: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Two types of scenarios in RL research

1. Solving a large planning problem using a learning approach- e.g., AlphaGo, video game playing, simulated robotics- Transition dynamics (Go rules) known, but too many states- Run the simulator to collect data

2. Solving a learning problem

Why statistical RL?

32

Page 149: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Two types of scenarios in RL research

1. Solving a large planning problem using a learning approach- e.g., AlphaGo, video game playing, simulated robotics- Transition dynamics (Go rules) known, but too many states- Run the simulator to collect data

2. Solving a learning problem- e.g., adaptive medical treatment

Why statistical RL?

32

Page 150: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Two types of scenarios in RL research

1. Solving a large planning problem using a learning approach- e.g., AlphaGo, video game playing, simulated robotics- Transition dynamics (Go rules) known, but too many states- Run the simulator to collect data

2. Solving a learning problem- e.g., adaptive medical treatment- Transition dynamics unknown (and too many states)

Why statistical RL?

32

Page 151: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Two types of scenarios in RL research

1. Solving a large planning problem using a learning approach- e.g., AlphaGo, video game playing, simulated robotics- Transition dynamics (Go rules) known, but too many states- Run the simulator to collect data

2. Solving a learning problem- e.g., adaptive medical treatment- Transition dynamics unknown (and too many states)- Interact with the environment to collect data

Why statistical RL?

32

Page 152: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Two types of scenarios in RL research

1. Solving a large planning problem using a learning approach

2. Solving a learning problem

Why statistical RL?

33

Page 153: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Two types of scenarios in RL research

1. Solving a large planning problem using a learning approach

2. Solving a learning problem

• I am more interested in #2. More challenging in some aspects.

Why statistical RL?

33

Page 154: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Two types of scenarios in RL research

1. Solving a large planning problem using a learning approach

2. Solving a learning problem

• I am more interested in #2. More challenging in some aspects.• Data (real-world interactions) is highest priority. Computation

second.

Why statistical RL?

33

Page 155: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Two types of scenarios in RL research

1. Solving a large planning problem using a learning approach

2. Solving a learning problem

• I am more interested in #2. More challenging in some aspects.• Data (real-world interactions) is highest priority. Computation

second. • Even for #1, sample complexity lower-bounds computational

complexity — statistical-first approach is reasonable.

Why statistical RL?

33

Page 156: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Two types of scenarios in RL research

1. Solving a large planning problem using a learning approach

2. Solving a learning problem

• I am more interested in #2. More challenging in some aspects.• Data (real-world interactions) is highest priority. Computation

second. • Even for #1, sample complexity lower-bounds computational

complexity — statistical-first approach is reasonable.• caveat to this argument: you can do a lot more in a simulator;

see http://hunch.net/?p=8825714

Why statistical RL?

33

Page 157: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

MDP Planning

Page 158: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Infinite-horizon discounted MDPs

35

Page 159: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

An MDP M = (S, A, P, R, γ)

Infinite-horizon discounted MDPs

35

Page 160: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

An MDP M = (S, A, P, R, γ)

• State space S.

Infinite-horizon discounted MDPs

35

Page 161: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

An MDP M = (S, A, P, R, γ)

• State space S.

• Action space A.

Infinite-horizon discounted MDPs

35

Page 162: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

An MDP M = (S, A, P, R, γ)

• State space S.

• Action space A.

Infinite-horizon discounted MDPs

35

We will only consider discrete and finite spaces in this course.

Page 163: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

An MDP M = (S, A, P, R, γ)

• State space S.

• Action space A.

• Transition function P : S×A→∆(S). ∆(S) is the probability simplex over S, i.e., all non-negative vectors of length |S| that sums up to 1

Infinite-horizon discounted MDPs

35

We will only consider discrete and finite spaces in this course.

Page 164: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

An MDP M = (S, A, P, R, γ)

• State space S.

• Action space A.

• Transition function P : S×A→∆(S). ∆(S) is the probability simplex over S, i.e., all non-negative vectors of length |S| that sums up to 1

• Reward function R: S×A→ℝ. (deterministic reward function)

Infinite-horizon discounted MDPs

35

We will only consider discrete and finite spaces in this course.

Page 165: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

An MDP M = (S, A, P, R, γ)

• State space S.

• Action space A.

• Transition function P : S×A→∆(S). ∆(S) is the probability simplex over S, i.e., all non-negative vectors of length |S| that sums up to 1

• Reward function R: S×A→ℝ. (deterministic reward function)

• Discount factor γ ∈ [0,1)

Infinite-horizon discounted MDPs

35

We will only consider discrete and finite spaces in this course.

Page 166: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

An MDP M = (S, A, P, R, γ)

• State space S.

• Action space A.

• Transition function P : S×A→∆(S). ∆(S) is the probability simplex over S, i.e., all non-negative vectors of length |S| that sums up to 1

• Reward function R: S×A→ℝ. (deterministic reward function)

• Discount factor γ ∈ [0,1)

• The agent starts in some state s1, takes action a1, receives reward r2 ~ R(s1, a1), transitions to s2 ~ P(s1, a1), takes action a2, so on so forth — the process continues indefinitely

Infinite-horizon discounted MDPs

35

We will only consider discrete and finite spaces in this course.

Page 167: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

• Want to take actions in a way that maximizes value (or return):

Value and policy

36

𝔼 [∑∞t=1 γt−1rt]

Page 168: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

• Want to take actions in a way that maximizes value (or return):

• This value depends on where you start and how you act

Value and policy

36

𝔼 [∑∞t=1 γt−1rt]

Page 169: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

• Want to take actions in a way that maximizes value (or return):

• This value depends on where you start and how you act• Often assume boundedness of rewards:

Value and policy

36

rt ∈ [0, Rmax]

𝔼 [∑∞t=1 γt−1rt]

Page 170: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

• Want to take actions in a way that maximizes value (or return):

• This value depends on where you start and how you act• Often assume boundedness of rewards:

• What’s the range of ?

Value and policy

36

rt ∈ [0, Rmax]𝔼 [∑∞

t=1 γt−1rt]

𝔼 [∑∞t=1 γt−1rt]

Page 171: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

• Want to take actions in a way that maximizes value (or return):

• This value depends on where you start and how you act• Often assume boundedness of rewards:

• What’s the range of ?

Value and policy

36

rt ∈ [0, Rmax]𝔼 [∑∞

t=1 γt−1rt] [0,Rmax

1 − γ ]

𝔼 [∑∞t=1 γt−1rt]

Page 172: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

• Want to take actions in a way that maximizes value (or return):

• This value depends on where you start and how you act• Often assume boundedness of rewards:

• What’s the range of ?

• A (deterministic) policy π: S→A describes how the agent acts: at state st, chooses action at = π(st).

Value and policy

36

rt ∈ [0, Rmax]𝔼 [∑∞

t=1 γt−1rt] [0,Rmax

1 − γ ]

𝔼 [∑∞t=1 γt−1rt]

Page 173: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

• Want to take actions in a way that maximizes value (or return):

• This value depends on where you start and how you act• Often assume boundedness of rewards:

• What’s the range of ?

• A (deterministic) policy π: S→A describes how the agent acts: at state st, chooses action at = π(st).

• More generally, the agent may choose actions randomly (π: S→ ∆(A)), or even in a way that varies across time steps (“non-stationary policies”)

Value and policy

36

rt ∈ [0, Rmax]𝔼 [∑∞

t=1 γt−1rt] [0,Rmax

1 − γ ]

𝔼 [∑∞t=1 γt−1rt]

Page 174: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

• Want to take actions in a way that maximizes value (or return):

• This value depends on where you start and how you act• Often assume boundedness of rewards:

• What’s the range of ?

• A (deterministic) policy π: S→A describes how the agent acts: at state st, chooses action at = π(st).

• More generally, the agent may choose actions randomly (π: S→ ∆(A)), or even in a way that varies across time steps (“non-stationary policies”)

• Define

Value and policy

36

rt ∈ [0, Rmax]𝔼 [∑∞

t=1 γt−1rt] [0,Rmax

1 − γ ]

𝔼 [∑∞t=1 γt−1rt]

Vπ(s) = 𝔼 [∑∞t=1 γt−1rt s1 = s, π ]

Page 175: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Bellman equation for policy evaluation

37

V ⇡(s) = E" 1X

t=1

�t�1rt��s1 = s,⇡

#

= E"r1 +

1X

t=2

�t�1rt��s1 = s,⇡

#

= R(s,⇡(s)) +X

s02SP (s0|s,⇡(s))E

"�

1X

t=2

�t�2rt��s1 = s, s2 = s0,⇡

#

= R(s,⇡(s)) +X

s02SP (s0|s,⇡(s))E

"�

1X

t=2

�t�2rt��s2 = s0,⇡

#

= R(s,⇡(s)) + �X

s02SP (s0|s,⇡(s))E

" 1X

t=1

�t�1rt��s1 = s0,⇡

#

= R(s,⇡(s)) + �X

s02SP (s0|s,⇡(s))V ⇡(s0)

= R(s,⇡(s)) + �hP (·|s,⇡(s)), V ⇡(·)i<latexit sha1_base64="EOqUViW3eaQRlDRi0wywMhhOOhU=">AAAFNniczVRbaxNBFJ52V63xluqjLweDJsE07AZBXwpFEXzwIV6SFjLJMjuZbIbOXpiZFcJ2+6d88Xf41hcfFPHVn+DsJoaYVqUFwYFhD+fyfd85nB0/EVxpxznZ2LTsS5evbF2tXLt+4+at6vbtvopTSVmPxiKWBz5RTPCI9TTXgh0kkpHQF2zfP3xWxPffMal4HL3Vs4QNQxJEfMIp0cblbVsv+6MMJzxvqCbswoNjAMAh0VPfz57nWLCJHmCVhl6md918hHk00TPAAQlDMsr0jpuD9DQcY58HR6A814CoFhhEwOYjeTDVQ4wrv4GWpuAh/CToXJQAlgyvG6plYqab5hJY1Q1qyUyJyN7kOXQbqn40RykzC6T1pksFf5TWOVua8jqFUf8/RZ5T2yrBhSTCubfnH4pb7nq9+TdQQaJAMAOC6TjWKzitVaAy2DR7Lct0r1pz2k554LThLowaWpyuV/2IxzFNQxZpKohSA9dJ9DAjUnMqWF7BqWIJoYckYANjRiRkapiVv30O941nDJNYmhtpKL2rFRkJlZqFvsks5qLWY4XzrNgg1ZMnw4xHSapZROdEk1SAjqF4Q2DMJaNazIxBqORGK9ApkYRq89JUzBDc9ZZPG/1O23Xa7qtHtb2ni3FsobvoHmogFz1Ge+gF6qIeotZ768T6bH2xP9if7K/2t3nq5sai5g765djffwCC5qb0</latexit><latexit sha1_base64="EOqUViW3eaQRlDRi0wywMhhOOhU=">AAAFNniczVRbaxNBFJ52V63xluqjLweDJsE07AZBXwpFEXzwIV6SFjLJMjuZbIbOXpiZFcJ2+6d88Xf41hcfFPHVn+DsJoaYVqUFwYFhD+fyfd85nB0/EVxpxznZ2LTsS5evbF2tXLt+4+at6vbtvopTSVmPxiKWBz5RTPCI9TTXgh0kkpHQF2zfP3xWxPffMal4HL3Vs4QNQxJEfMIp0cblbVsv+6MMJzxvqCbswoNjAMAh0VPfz57nWLCJHmCVhl6md918hHk00TPAAQlDMsr0jpuD9DQcY58HR6A814CoFhhEwOYjeTDVQ4wrv4GWpuAh/CToXJQAlgyvG6plYqab5hJY1Q1qyUyJyN7kOXQbqn40RykzC6T1pksFf5TWOVua8jqFUf8/RZ5T2yrBhSTCubfnH4pb7nq9+TdQQaJAMAOC6TjWKzitVaAy2DR7Lct0r1pz2k554LThLowaWpyuV/2IxzFNQxZpKohSA9dJ9DAjUnMqWF7BqWIJoYckYANjRiRkapiVv30O941nDJNYmhtpKL2rFRkJlZqFvsks5qLWY4XzrNgg1ZMnw4xHSapZROdEk1SAjqF4Q2DMJaNazIxBqORGK9ApkYRq89JUzBDc9ZZPG/1O23Xa7qtHtb2ni3FsobvoHmogFz1Ge+gF6qIeotZ768T6bH2xP9if7K/2t3nq5sai5g765djffwCC5qb0</latexit><latexit sha1_base64="EOqUViW3eaQRlDRi0wywMhhOOhU=">AAAFNniczVRbaxNBFJ52V63xluqjLweDJsE07AZBXwpFEXzwIV6SFjLJMjuZbIbOXpiZFcJ2+6d88Xf41hcfFPHVn+DsJoaYVqUFwYFhD+fyfd85nB0/EVxpxznZ2LTsS5evbF2tXLt+4+at6vbtvopTSVmPxiKWBz5RTPCI9TTXgh0kkpHQF2zfP3xWxPffMal4HL3Vs4QNQxJEfMIp0cblbVsv+6MMJzxvqCbswoNjAMAh0VPfz57nWLCJHmCVhl6md918hHk00TPAAQlDMsr0jpuD9DQcY58HR6A814CoFhhEwOYjeTDVQ4wrv4GWpuAh/CToXJQAlgyvG6plYqab5hJY1Q1qyUyJyN7kOXQbqn40RykzC6T1pksFf5TWOVua8jqFUf8/RZ5T2yrBhSTCubfnH4pb7nq9+TdQQaJAMAOC6TjWKzitVaAy2DR7Lct0r1pz2k554LThLowaWpyuV/2IxzFNQxZpKohSA9dJ9DAjUnMqWF7BqWIJoYckYANjRiRkapiVv30O941nDJNYmhtpKL2rFRkJlZqFvsks5qLWY4XzrNgg1ZMnw4xHSapZROdEk1SAjqF4Q2DMJaNazIxBqORGK9ApkYRq89JUzBDc9ZZPG/1O23Xa7qtHtb2ni3FsobvoHmogFz1Ge+gF6qIeotZ768T6bH2xP9if7K/2t3nq5sai5g765djffwCC5qb0</latexit><latexit sha1_base64="EOqUViW3eaQRlDRi0wywMhhOOhU=">AAAFNniczVRbaxNBFJ52V63xluqjLweDJsE07AZBXwpFEXzwIV6SFjLJMjuZbIbOXpiZFcJ2+6d88Xf41hcfFPHVn+DsJoaYVqUFwYFhD+fyfd85nB0/EVxpxznZ2LTsS5evbF2tXLt+4+at6vbtvopTSVmPxiKWBz5RTPCI9TTXgh0kkpHQF2zfP3xWxPffMal4HL3Vs4QNQxJEfMIp0cblbVsv+6MMJzxvqCbswoNjAMAh0VPfz57nWLCJHmCVhl6md918hHk00TPAAQlDMsr0jpuD9DQcY58HR6A814CoFhhEwOYjeTDVQ4wrv4GWpuAh/CToXJQAlgyvG6plYqab5hJY1Q1qyUyJyN7kOXQbqn40RykzC6T1pksFf5TWOVua8jqFUf8/RZ5T2yrBhSTCubfnH4pb7nq9+TdQQaJAMAOC6TjWKzitVaAy2DR7Lct0r1pz2k554LThLowaWpyuV/2IxzFNQxZpKohSA9dJ9DAjUnMqWF7BqWIJoYckYANjRiRkapiVv30O941nDJNYmhtpKL2rFRkJlZqFvsks5qLWY4XzrNgg1ZMnw4xHSapZROdEk1SAjqF4Q2DMJaNazIxBqORGK9ApkYRq89JUzBDc9ZZPG/1O23Xa7qtHtb2ni3FsobvoHmogFz1Ge+gF6qIeotZ768T6bH2xP9if7K/2t3nq5sai5g765djffwCC5qb0</latexit>

Page 176: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Bellman equation for policy evaluation

37

V ⇡(s) = E" 1X

t=1

�t�1rt��s1 = s,⇡

#

= E"r1 +

1X

t=2

�t�1rt��s1 = s,⇡

#

= R(s,⇡(s)) +X

s02SP (s0|s,⇡(s))E

"�

1X

t=2

�t�2rt��s1 = s, s2 = s0,⇡

#

= R(s,⇡(s)) +X

s02SP (s0|s,⇡(s))E

"�

1X

t=2

�t�2rt��s2 = s0,⇡

#

= R(s,⇡(s)) + �X

s02SP (s0|s,⇡(s))E

" 1X

t=1

�t�1rt��s1 = s0,⇡

#

= R(s,⇡(s)) + �X

s02SP (s0|s,⇡(s))V ⇡(s0)

= R(s,⇡(s)) + �hP (·|s,⇡(s)), V ⇡(·)i<latexit sha1_base64="EOqUViW3eaQRlDRi0wywMhhOOhU=">AAAFNniczVRbaxNBFJ52V63xluqjLweDJsE07AZBXwpFEXzwIV6SFjLJMjuZbIbOXpiZFcJ2+6d88Xf41hcfFPHVn+DsJoaYVqUFwYFhD+fyfd85nB0/EVxpxznZ2LTsS5evbF2tXLt+4+at6vbtvopTSVmPxiKWBz5RTPCI9TTXgh0kkpHQF2zfP3xWxPffMal4HL3Vs4QNQxJEfMIp0cblbVsv+6MMJzxvqCbswoNjAMAh0VPfz57nWLCJHmCVhl6md918hHk00TPAAQlDMsr0jpuD9DQcY58HR6A814CoFhhEwOYjeTDVQ4wrv4GWpuAh/CToXJQAlgyvG6plYqab5hJY1Q1qyUyJyN7kOXQbqn40RykzC6T1pksFf5TWOVua8jqFUf8/RZ5T2yrBhSTCubfnH4pb7nq9+TdQQaJAMAOC6TjWKzitVaAy2DR7Lct0r1pz2k554LThLowaWpyuV/2IxzFNQxZpKohSA9dJ9DAjUnMqWF7BqWIJoYckYANjRiRkapiVv30O941nDJNYmhtpKL2rFRkJlZqFvsks5qLWY4XzrNgg1ZMnw4xHSapZROdEk1SAjqF4Q2DMJaNazIxBqORGK9ApkYRq89JUzBDc9ZZPG/1O23Xa7qtHtb2ni3FsobvoHmogFz1Ge+gF6qIeotZ768T6bH2xP9if7K/2t3nq5sai5g765djffwCC5qb0</latexit><latexit sha1_base64="EOqUViW3eaQRlDRi0wywMhhOOhU=">AAAFNniczVRbaxNBFJ52V63xluqjLweDJsE07AZBXwpFEXzwIV6SFjLJMjuZbIbOXpiZFcJ2+6d88Xf41hcfFPHVn+DsJoaYVqUFwYFhD+fyfd85nB0/EVxpxznZ2LTsS5evbF2tXLt+4+at6vbtvopTSVmPxiKWBz5RTPCI9TTXgh0kkpHQF2zfP3xWxPffMal4HL3Vs4QNQxJEfMIp0cblbVsv+6MMJzxvqCbswoNjAMAh0VPfz57nWLCJHmCVhl6md918hHk00TPAAQlDMsr0jpuD9DQcY58HR6A814CoFhhEwOYjeTDVQ4wrv4GWpuAh/CToXJQAlgyvG6plYqab5hJY1Q1qyUyJyN7kOXQbqn40RykzC6T1pksFf5TWOVua8jqFUf8/RZ5T2yrBhSTCubfnH4pb7nq9+TdQQaJAMAOC6TjWKzitVaAy2DR7Lct0r1pz2k554LThLowaWpyuV/2IxzFNQxZpKohSA9dJ9DAjUnMqWF7BqWIJoYckYANjRiRkapiVv30O941nDJNYmhtpKL2rFRkJlZqFvsks5qLWY4XzrNgg1ZMnw4xHSapZROdEk1SAjqF4Q2DMJaNazIxBqORGK9ApkYRq89JUzBDc9ZZPG/1O23Xa7qtHtb2ni3FsobvoHmogFz1Ge+gF6qIeotZ768T6bH2xP9if7K/2t3nq5sai5g765djffwCC5qb0</latexit><latexit sha1_base64="EOqUViW3eaQRlDRi0wywMhhOOhU=">AAAFNniczVRbaxNBFJ52V63xluqjLweDJsE07AZBXwpFEXzwIV6SFjLJMjuZbIbOXpiZFcJ2+6d88Xf41hcfFPHVn+DsJoaYVqUFwYFhD+fyfd85nB0/EVxpxznZ2LTsS5evbF2tXLt+4+at6vbtvopTSVmPxiKWBz5RTPCI9TTXgh0kkpHQF2zfP3xWxPffMal4HL3Vs4QNQxJEfMIp0cblbVsv+6MMJzxvqCbswoNjAMAh0VPfz57nWLCJHmCVhl6md918hHk00TPAAQlDMsr0jpuD9DQcY58HR6A814CoFhhEwOYjeTDVQ4wrv4GWpuAh/CToXJQAlgyvG6plYqab5hJY1Q1qyUyJyN7kOXQbqn40RykzC6T1pksFf5TWOVua8jqFUf8/RZ5T2yrBhSTCubfnH4pb7nq9+TdQQaJAMAOC6TjWKzitVaAy2DR7Lct0r1pz2k554LThLowaWpyuV/2IxzFNQxZpKohSA9dJ9DAjUnMqWF7BqWIJoYckYANjRiRkapiVv30O941nDJNYmhtpKL2rFRkJlZqFvsks5qLWY4XzrNgg1ZMnw4xHSapZROdEk1SAjqF4Q2DMJaNazIxBqORGK9ApkYRq89JUzBDc9ZZPG/1O23Xa7qtHtb2ni3FsobvoHmogFz1Ge+gF6qIeotZ768T6bH2xP9if7K/2t3nq5sai5g765djffwCC5qb0</latexit><latexit sha1_base64="EOqUViW3eaQRlDRi0wywMhhOOhU=">AAAFNniczVRbaxNBFJ52V63xluqjLweDJsE07AZBXwpFEXzwIV6SFjLJMjuZbIbOXpiZFcJ2+6d88Xf41hcfFPHVn+DsJoaYVqUFwYFhD+fyfd85nB0/EVxpxznZ2LTsS5evbF2tXLt+4+at6vbtvopTSVmPxiKWBz5RTPCI9TTXgh0kkpHQF2zfP3xWxPffMal4HL3Vs4QNQxJEfMIp0cblbVsv+6MMJzxvqCbswoNjAMAh0VPfz57nWLCJHmCVhl6md918hHk00TPAAQlDMsr0jpuD9DQcY58HR6A814CoFhhEwOYjeTDVQ4wrv4GWpuAh/CToXJQAlgyvG6plYqab5hJY1Q1qyUyJyN7kOXQbqn40RykzC6T1pksFf5TWOVua8jqFUf8/RZ5T2yrBhSTCubfnH4pb7nq9+TdQQaJAMAOC6TjWKzitVaAy2DR7Lct0r1pz2k554LThLowaWpyuV/2IxzFNQxZpKohSA9dJ9DAjUnMqWF7BqWIJoYckYANjRiRkapiVv30O941nDJNYmhtpKL2rFRkJlZqFvsks5qLWY4XzrNgg1ZMnw4xHSapZROdEk1SAjqF4Q2DMJaNazIxBqORGK9ApkYRq89JUzBDc9ZZPG/1O23Xa7qtHtb2ni3FsobvoHmogFz1Ge+gF6qIeotZ768T6bH2xP9if7K/2t3nq5sai5g765djffwCC5qb0</latexit>

Page 177: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Bellman equation for policy evaluation

38

Matrix form: define

• Vπ as the |S|×1 vector [Vπ(s)]s∈S

Vπ(s) = R(s, π(s)) + γ⟨P( ⋅ |s, π(s)), Vπ( ⋅ )⟩

Page 178: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Bellman equation for policy evaluation

38

Matrix form: define

• Vπ as the |S|×1 vector [Vπ(s)]s∈S

• Rπ as the vector [R(s, π(s))]s∈S

Vπ(s) = R(s, π(s)) + γ⟨P( ⋅ |s, π(s)), Vπ( ⋅ )⟩

Page 179: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Bellman equation for policy evaluation

38

Matrix form: define

• Vπ as the |S|×1 vector [Vπ(s)]s∈S

• Rπ as the vector [R(s, π(s))]s∈S

• Pπ as the matrix [P(s’|s, π(s))]s∈S, s’∈S

Vπ(s) = R(s, π(s)) + γ⟨P( ⋅ |s, π(s)), Vπ( ⋅ )⟩

Page 180: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Bellman equation for policy evaluation

38

Matrix form: define

• Vπ as the |S|×1 vector [Vπ(s)]s∈S

• Rπ as the vector [R(s, π(s))]s∈S

• Pπ as the matrix [P(s’|s, π(s))]s∈S, s’∈S

Vπ(s) = R(s, π(s)) + γ⟨P( ⋅ |s, π(s)), Vπ( ⋅ )⟩

Vπ = Rπ + γPπVπ

Page 181: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Bellman equation for policy evaluation

38

Matrix form: define

• Vπ as the |S|×1 vector [Vπ(s)]s∈S

• Rπ as the vector [R(s, π(s))]s∈S

• Pπ as the matrix [P(s’|s, π(s))]s∈S, s’∈S

Vπ(s) = R(s, π(s)) + γ⟨P( ⋅ |s, π(s)), Vπ( ⋅ )⟩

Vπ = Rπ + γPπVπ

(I − γPπ)Vπ = Rπ

Page 182: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Bellman equation for policy evaluation

38

Matrix form: define

• Vπ as the |S|×1 vector [Vπ(s)]s∈S

• Rπ as the vector [R(s, π(s))]s∈S

• Pπ as the matrix [P(s’|s, π(s))]s∈S, s’∈S

Vπ(s) = R(s, π(s)) + γ⟨P( ⋅ |s, π(s)), Vπ( ⋅ )⟩

Vπ = Rπ + γPπVπ

(I − γPπ)Vπ = Rπ

Vπ = (I − γPπ)−1Rπ

Page 183: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Bellman equation for policy evaluation

38

Matrix form: define

• Vπ as the |S|×1 vector [Vπ(s)]s∈S

• Rπ as the vector [R(s, π(s))]s∈S

• Pπ as the matrix [P(s’|s, π(s))]s∈S, s’∈S

Vπ(s) = R(s, π(s)) + γ⟨P( ⋅ |s, π(s)), Vπ( ⋅ )⟩

Vπ = Rπ + γPπVπ

(I − γPπ)Vπ = Rπ

Vπ = (I − γPπ)−1Rπ

This is always invertible. Proof?

Page 184: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

State occupancy

39

(I − γPπ)−1

Page 185: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Each row (indexed by s) is the discounted state occupancy dsπ,

whose (s’)-th entry is

State occupancy

39

(I − γPπ)−1

Page 186: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Each row (indexed by s) is the discounted state occupancy dsπ,

whose (s’)-th entry is

State occupancy

39

(I − γPπ)−1

dπs (s′�) = 𝔼 [

∑t=1

γt−1 𝕀[st = s′�] s1 = s, π ]

Page 187: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Each row (indexed by s) is the discounted state occupancy dsπ,

whose (s’)-th entry is

• Each row is like a distribution vector—except that the entries sum up to 1/(1-γ). Let denote the normalized vector.

State occupancy

39

(I − γPπ)−1

dπs (s′�) = 𝔼 [

∑t=1

γt−1 𝕀[st = s′�] s1 = s, π ]ηπ

s = (1 − γ) dπs

Page 188: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Each row (indexed by s) is the discounted state occupancy dsπ,

whose (s’)-th entry is

• Each row is like a distribution vector—except that the entries sum up to 1/(1-γ). Let denote the normalized vector.

• Vπ(s) is the dot product between and reward vector

State occupancy

39

(I − γPπ)−1

dπs (s′�) = 𝔼 [

∑t=1

γt−1 𝕀[st = s′�] s1 = s, π ]ηπ

s = (1 − γ) dπs

dπs

Page 189: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Each row (indexed by s) is the discounted state occupancy dsπ,

whose (s’)-th entry is

• Each row is like a distribution vector—except that the entries sum up to 1/(1-γ). Let denote the normalized vector.

• Vπ(s) is the dot product between and reward vector

• Can also be interpreted as the value function of indicator reward function

State occupancy

39

(I − γPπ)−1

dπs (s′�) = 𝔼 [

∑t=1

γt−1 𝕀[st = s′�] s1 = s, π ]ηπ

s = (1 − γ) dπs

dπs

Page 190: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

• For infinite-horizon discounted MDPs, there always exists a stationary and deterministic policy that is optimal for all starting states simultaneously• Proof: Puterman’94, Thm 6.2.7 (reference due to Shipra Agrawal)

Optimality

40

Page 191: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

• For infinite-horizon discounted MDPs, there always exists a stationary and deterministic policy that is optimal for all starting states simultaneously• Proof: Puterman’94, Thm 6.2.7 (reference due to Shipra Agrawal)

• Let π* denote this optimal policy, and V* := Vπ*

Optimality

40

Page 192: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

• For infinite-horizon discounted MDPs, there always exists a stationary and deterministic policy that is optimal for all starting states simultaneously• Proof: Puterman’94, Thm 6.2.7 (reference due to Shipra Agrawal)

• Let π* denote this optimal policy, and V* := Vπ*

• Bellman Optimality Equation:

Optimality

40

V⋆(s) = maxa∈A (R(s, a) + γ𝔼s′�∼P(s,a) [V⋆(s′�)])

Page 193: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

• For infinite-horizon discounted MDPs, there always exists a stationary and deterministic policy that is optimal for all starting states simultaneously• Proof: Puterman’94, Thm 6.2.7 (reference due to Shipra Agrawal)

• Let π* denote this optimal policy, and V* := Vπ*

• Bellman Optimality Equation:

• If we know V*, how to get π* ?

Optimality

40

V⋆(s) = maxa∈A (R(s, a) + γ𝔼s′�∼P(s,a) [V⋆(s′�)])

Page 194: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

• For infinite-horizon discounted MDPs, there always exists a stationary and deterministic policy that is optimal for all starting states simultaneously• Proof: Puterman’94, Thm 6.2.7 (reference due to Shipra Agrawal)

• Let π* denote this optimal policy, and V* := Vπ*

• Bellman Optimality Equation:

• If we know V*, how to get π* ?

Optimality

40

V⋆(s) = maxa∈A (R(s, a) + γ𝔼s′�∼P(s,a) [V⋆(s′�)])

Page 195: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

• For infinite-horizon discounted MDPs, there always exists a stationary and deterministic policy that is optimal for all starting states simultaneously• Proof: Puterman’94, Thm 6.2.7 (reference due to Shipra Agrawal)

• Let π* denote this optimal policy, and V* := Vπ*

• Bellman Optimality Equation:

• If we know V*, how to get π* ?

• Easier to work with Q-values: Q*(s, a), as

Optimality

40

V⋆(s) = maxa∈A (R(s, a) + γ𝔼s′�∼P(s,a) [V⋆(s′�)])

π⋆(s) = arg maxa∈A

Q⋆(s, a)

Page 196: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

• For infinite-horizon discounted MDPs, there always exists a stationary and deterministic policy that is optimal for all starting states simultaneously• Proof: Puterman’94, Thm 6.2.7 (reference due to Shipra Agrawal)

• Let π* denote this optimal policy, and V* := Vπ*

• Bellman Optimality Equation:

• If we know V*, how to get π* ?

• Easier to work with Q-values: Q*(s, a), as

Optimality

40

V⋆(s) = maxa∈A (R(s, a) + γ𝔼s′�∼P(s,a) [V⋆(s′�)])

π⋆(s) = arg maxa∈A

Q⋆(s, a)

Q⋆(s, a) = R(s, a) + γ𝔼s′�∼P(s,a) [maxa′�∈A

Q⋆(s′�, a′�)]

Page 197: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

Ad Hoc Homework 1

41

Page 198: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

• uploaded on course website

Ad Hoc Homework 1

41

Page 199: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

• uploaded on course website

• help understand the relationships between alternative MDP formulations

Ad Hoc Homework 1

41

Page 200: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

• uploaded on course website

• help understand the relationships between alternative MDP formulations

• more like readings w/ questions to think about

Ad Hoc Homework 1

41

Page 201: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

• uploaded on course website

• help understand the relationships between alternative MDP formulations

• more like readings w/ questions to think about

• no need to submit

Ad Hoc Homework 1

41

Page 202: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •

• uploaded on course website

• help understand the relationships between alternative MDP formulations

• more like readings w/ questions to think about

• no need to submit

• will go through in class next week

Ad Hoc Homework 1

41