View
3
Download
0
Category
Preview:
Citation preview
CS 598 Statistical Reinforcement Learning
Nan Jiang
Overview
What’s this course about?
• A grad-level seminar course on theory of RL
3
What’s this course about?
• A grad-level seminar course on theory of RL• with focus on sample complexity analyses
3
What’s this course about?
• A grad-level seminar course on theory of RL• with focus on sample complexity analyses• all about proofs, some perspectives, 0 implementation
3
What’s this course about?
• A grad-level seminar course on theory of RL• with focus on sample complexity analyses• all about proofs, some perspectives, 0 implementation• Seminar course can be anywhere between students presenting
papers and a rigorous course
3
What’s this course about?
• A grad-level seminar course on theory of RL• with focus on sample complexity analyses• all about proofs, some perspectives, 0 implementation• Seminar course can be anywhere between students presenting
papers and a rigorous course • In this course I will deliver most (or all) of the lectures
3
What’s this course about?
• A grad-level seminar course on theory of RL• with focus on sample complexity analyses• all about proofs, some perspectives, 0 implementation• Seminar course can be anywhere between students presenting
papers and a rigorous course • In this course I will deliver most (or all) of the lectures• No text book; material is created by myself (course notes)
3
What’s this course about?
• A grad-level seminar course on theory of RL• with focus on sample complexity analyses• all about proofs, some perspectives, 0 implementation• Seminar course can be anywhere between students presenting
papers and a rigorous course • In this course I will deliver most (or all) of the lectures• No text book; material is created by myself (course notes)• will share slides/notes on course website
3
Who should take this course?
• This course will be a good fit for you if you…
4
Who should take this course?
• This course will be a good fit for you if you…• are interested in understanding RL mathematically
4
Who should take this course?
• This course will be a good fit for you if you…• are interested in understanding RL mathematically• want to do research in theory of RL or related area
4
Who should take this course?
• This course will be a good fit for you if you…• are interested in understanding RL mathematically• want to do research in theory of RL or related area• want to work with me
4
Who should take this course?
• This course will be a good fit for you if you…• are interested in understanding RL mathematically• want to do research in theory of RL or related area• want to work with me• are comfortable with maths
4
Who should take this course?
• This course will be a good fit for you if you…• are interested in understanding RL mathematically• want to do research in theory of RL or related area• want to work with me• are comfortable with maths
• This course will not be a good fit if you…
4
Who should take this course?
• This course will be a good fit for you if you…• are interested in understanding RL mathematically• want to do research in theory of RL or related area• want to work with me• are comfortable with maths
• This course will not be a good fit if you…• are mostly interested in implementing RL algorithms and
making things to work we won’t cover any engineering tricks (which are essential to e.g., Deep RL); check other RL courses offered across the campus
4
Prerequisites
• Maths
5
Prerequisites
• Maths• Linear algebra, probability & statistics, basic calculus
5
Prerequisites
• Maths• Linear algebra, probability & statistics, basic calculus• Markov chains
5
Prerequisites
• Maths• Linear algebra, probability & statistics, basic calculus• Markov chains• Optional: stochastic processes, numerical analysis
5
Prerequisites
• Maths• Linear algebra, probability & statistics, basic calculus• Markov chains• Optional: stochastic processes, numerical analysis• Useful for research: TCS background, empirical processes
and statistical learning theory, optimization, online learning
5
Prerequisites
• Maths• Linear algebra, probability & statistics, basic calculus• Markov chains• Optional: stochastic processes, numerical analysis• Useful for research: TCS background, empirical processes
and statistical learning theory, optimization, online learning• Exposure to ML
5
Prerequisites
• Maths• Linear algebra, probability & statistics, basic calculus• Markov chains• Optional: stochastic processes, numerical analysis• Useful for research: TCS background, empirical processes
and statistical learning theory, optimization, online learning• Exposure to ML
• e.g., CS 446 Machine Learning
5
Prerequisites
• Maths• Linear algebra, probability & statistics, basic calculus• Markov chains• Optional: stochastic processes, numerical analysis• Useful for research: TCS background, empirical processes
and statistical learning theory, optimization, online learning• Exposure to ML
• e.g., CS 446 Machine Learning• Experience with RL
5
Coursework
• Some readings after/before class
6
Coursework
• Some readings after/before class• Ad hoc homework to help digest particular material. Deadline
will be lenient & TBA at the time of assignment
6
Coursework
• Some readings after/before class• Ad hoc homework to help digest particular material. Deadline
will be lenient & TBA at the time of assignment• Main assignment: course project (work on your own)
6
Coursework
• Some readings after/before class• Ad hoc homework to help digest particular material. Deadline
will be lenient & TBA at the time of assignment• Main assignment: course project (work on your own)
• Baseline: reproduce theoretical analysis in existing papers
6
Coursework
• Some readings after/before class• Ad hoc homework to help digest particular material. Deadline
will be lenient & TBA at the time of assignment• Main assignment: course project (work on your own)
• Baseline: reproduce theoretical analysis in existing papers• Advanced: identify an interesting/challenging extension to
the paper and explore the novel research question yourself
6
Coursework
• Some readings after/before class• Ad hoc homework to help digest particular material. Deadline
will be lenient & TBA at the time of assignment• Main assignment: course project (work on your own)
• Baseline: reproduce theoretical analysis in existing papers• Advanced: identify an interesting/challenging extension to
the paper and explore the novel research question yourself• Or, just work on a novel research question (must have a
significant theoretical component; need to discuss with me)
6
Course project (cont.)
• See list of references and potential topics on website
7
Course project (cont.)
• See list of references and potential topics on website• You will need to submit:
7
Course project (cont.)
• See list of references and potential topics on website• You will need to submit:
• A brief proposal (~1/2 page). Tentative deadline: end of Feb
7
Course project (cont.)
• See list of references and potential topics on website• You will need to submit:
• A brief proposal (~1/2 page). Tentative deadline: end of Feb• what’s the topic and what papers you plan to work on
7
Course project (cont.)
• See list of references and potential topics on website• You will need to submit:
• A brief proposal (~1/2 page). Tentative deadline: end of Feb• what’s the topic and what papers you plan to work on• why you choose the topic: what interest you?
7
Course project (cont.)
• See list of references and potential topics on website• You will need to submit:
• A brief proposal (~1/2 page). Tentative deadline: end of Feb• what’s the topic and what papers you plan to work on• why you choose the topic: what interest you?• which aspect(s) you will focus on?
7
Course project (cont.)
• See list of references and potential topics on website• You will need to submit:
• A brief proposal (~1/2 page). Tentative deadline: end of Feb• what’s the topic and what papers you plan to work on• why you choose the topic: what interest you?• which aspect(s) you will focus on?
• Final report: clarity, precision, and brevity are greatly valued. More details to come…
7
Course project (cont.)
• See list of references and potential topics on website• You will need to submit:
• A brief proposal (~1/2 page). Tentative deadline: end of Feb• what’s the topic and what papers you plan to work on• why you choose the topic: what interest you?• which aspect(s) you will focus on?
• Final report: clarity, precision, and brevity are greatly valued. More details to come…
• All docs should be in pdf. Final report should be prepared using LaTeX.
7
Course project (cont. 2)
Rule of thumb
8
Course project (cont. 2)
Rule of thumb1. learn something that interests you
8
Course project (cont. 2)
Rule of thumb1. learn something that interests you2. teach me something! (I wouldn’t learn if I could not
understand your report due to lack of clarity)
8
Course project (cont. 2)
Rule of thumb1. learn something that interests you2. teach me something! (I wouldn’t learn if I could not
understand your report due to lack of clarity)3. write a report similar to (or better than!) the notes I will share
with you
8
Contents of the course
• many important topics in RL will not be covered in depth (e.g., TD). read more if you want to get a more comprehensive view of RL
9
Contents of the course
• many important topics in RL will not be covered in depth (e.g., TD). read more if you want to get a more comprehensive view of RL
• the other opportunity to learn what’s not covered in lectures is the project, as potential topics for projects are much broader than what’s covered in class.
9
My goals
• Encourage you to do research in this area
10
My goals
• Encourage you to do research in this area• Introduce useful mathematical tools to you
10
My goals
• Encourage you to do research in this area• Introduce useful mathematical tools to you• Learn from you
10
• Course website: http://nanjiang.cs.illinois.edu/cs598/
11
Logistics
• Course website: http://nanjiang.cs.illinois.edu/cs598/• logistics, links to slides/notes (uploaded after lectures),
and resources (e.g., textbooks to consult, related courses), deadlines & announcements
11
Logistics
• Course website: http://nanjiang.cs.illinois.edu/cs598/• logistics, links to slides/notes (uploaded after lectures),
and resources (e.g., textbooks to consult, related courses), deadlines & announcements
• Time & Location: Tue & Thu 12:30-1:45pm, 1304 Siebel.
11
Logistics
• Course website: http://nanjiang.cs.illinois.edu/cs598/• logistics, links to slides/notes (uploaded after lectures),
and resources (e.g., textbooks to consult, related courses), deadlines & announcements
• Time & Location: Tue & Thu 12:30-1:45pm, 1304 Siebel.• TA: Jinglin Chen (jinglinc)
11
Logistics
• Course website: http://nanjiang.cs.illinois.edu/cs598/• logistics, links to slides/notes (uploaded after lectures),
and resources (e.g., textbooks to consult, related courses), deadlines & announcements
• Time & Location: Tue & Thu 12:30-1:45pm, 1304 Siebel.• TA: Jinglin Chen (jinglinc)• Office hours: By appointment. 3322 Siebel.
11
Logistics
• Course website: http://nanjiang.cs.illinois.edu/cs598/• logistics, links to slides/notes (uploaded after lectures),
and resources (e.g., textbooks to consult, related courses), deadlines & announcements
• Time & Location: Tue & Thu 12:30-1:45pm, 1304 Siebel.• TA: Jinglin Chen (jinglinc)• Office hours: By appointment. 3322 Siebel.
• Questions about material: ad hoc meetings after lectures subject to my availability
11
Logistics
• Course website: http://nanjiang.cs.illinois.edu/cs598/• logistics, links to slides/notes (uploaded after lectures),
and resources (e.g., textbooks to consult, related courses), deadlines & announcements
• Time & Location: Tue & Thu 12:30-1:45pm, 1304 Siebel.• TA: Jinglin Chen (jinglinc)• Office hours: By appointment. 3322 Siebel.
• Questions about material: ad hoc meetings after lectures subject to my availability
• Other things about the class (e.g., project): by appointment
11
Logistics
Introduction to MDPs and RL
13
Reinforcement Learning (RL)Applications
13
[Levine et al’16] [Ng et al’03]
Reinforcement Learning (RL)Applications
13
[Levine et al’16] [Ng et al’03] [Singh et al’02]
Reinforcement Learning (RL)Applications
13
[Levine et al’16] [Ng et al’03] [Singh et al’02]
Reinforcement Learning (RL)
[Lei et al’12]
Applications
13
[Levine et al’16] [Ng et al’03][Mandel et al’16]
[Singh et al’02]
Reinforcement Learning (RL)
[Lei et al’12]
Applications
13
[Levine et al’16] [Ng et al’03][Mandel et al’16]
[Singh et al’02]
Reinforcement Learning (RL)
[Lei et al’12][Tesauro et al’07]
Applications
13
[Levine et al’16] [Ng et al’03][Mandel et al’16]
[Singh et al’02]
Reinforcement Learning (RL)
[Mnih et al’15][Silver et al’16][Lei et al’12]
[Tesauro et al’07]
Applications
14
s0
b
c
d
e
f
g1
2
3
2
1
3
14
1
Shortest Path
14
s0
b
c
d
e
f
g1
2
3
2
1
3
14
1
Shortest Path
state
action
14
s0
b
c
d
e
f
g1
2
3
2
1
3
14
Greedy is suboptimal due to delayed effects
1
Need long-term planning
Shortest Path
state
action
14
s0
b
c
d
e
f
g1
2
3
2
1
3
14
Greedy is suboptimal due to delayed effects
1
V*(g) = 0
V*(f) = 1
Need long-term planning
Shortest Path
state
action
14
s0
b
c
d
e
f
g1
2
3
2
1
3
14
Greedy is suboptimal due to delayed effects
1
V*(g) = 0
V*(f) = 1
V*(d) = min{3 + V*(g) , 2 + V*(f)}
Need long-term planning
Shortest Path
state
action
Bellman Equation
14
s0
b
c
d
e
f
g1
2
3
2
1
3
14
Greedy is suboptimal due to delayed effects
1
V*(g) = 0
V*(f) = 1
V*(d) = 2
V*(d) = min{3 + V*(g) , 2 + V*(f)}
Need long-term planning
Shortest Path
state
action
Bellman Equation
14
s0
b
c
d
e
f
g1
2
3
2
1
3
14
Greedy is suboptimal due to delayed effects
1
V*(g) = 0
V*(f) = 1
V*(d) = 2
V*(e) = 2V*(c) = 4
V*(b) = 5
V*(s0) = 6
V*(d) = min{3 + V*(g) , 2 + V*(f)}
Need long-term planning
Shortest Path
Bellman Equation
14
s0
b
c
d
e
f
g1
2
3
2
1
3
14
Greedy is suboptimal due to delayed effects
1
V*(g) = 0
V*(f) = 1
V*(d) = 2
V*(e) = 2V*(c) = 4
V*(b) = 5
V*(s0) = 6
V*(d) = min{3 + V*(g) , 2 + V*(f)}
Need long-term planning
Shortest Path
Bellman Equation
15
s0
b
c
d
e
f
g1
2
2
1
3
14
1
Shortest Path
4
16
s0
b
c
d
e
f
g0.7
0.3
Stochastic
1
2
2
1
3
14
1
Shortest Path
e
0.50.5
4
16
s0
b
c
d
e
f
g0.7
0.3
Stochastic
1
2
2
1
3
14
1
Shortest Path
transition distribution
e
0.50.5
Markov Decision Process (MDP)
4
state
action
1
17
s0
b
c
d
e
f
g
0.50.5
1
2
4
2
1
3
14
0.7
0.3
Stochastic Shortest Path
Greedy is suboptimal due to delayed effectsNeed long-term planning
1
17
s0
b
c
d
e
f
g
0.50.5
1
2
4
2
1
3
14
0.7
0.3
V*(c) = min{4 + 0.7 × V*(d) + 0.3 × V*(e) , 2 + V*(e)}
Stochastic Shortest Path
Bellman Equation
Greedy is suboptimal due to delayed effectsNeed long-term planning
1
17
s0
b
c
d
e
f
g
0.50.5
1
2
4
2
1
3
14
0.7
0.3
V*(c) = min{4 + 0.7 × V*(d) + 0.3 × V*(e) , 2 + V*(e)}
Stochastic Shortest Path
V*(g) = 0
Bellman Equation
Greedy is suboptimal due to delayed effectsNeed long-term planning
1
17
s0
b
c
d
e
f
g
0.50.5
1
2
4
2
1
3
14
0.7
0.3
V*(c) = min{4 + 0.7 × V*(d) + 0.3 × V*(e) , 2 + V*(e)}
Stochastic Shortest Path
V*(g) = 0
V*(f) = 1
V*(d) = 2
V*(e) = 2.5V*(c) = 4.5
V*(b) = 6
V*(s0) = 6.5
Bellman Equation
Greedy is suboptimal due to delayed effectsNeed long-term planning
1
17
s0
b
c
d
e
f
g
0.50.5
1
2
4
2
1
3
14
0.7
0.3
V*(c) = min{4 + 0.7 × V*(d) + 0.3 × V*(e) , 2 + V*(e)}
Stochastic Shortest Path
✓: optimal policy π*
V*(g) = 0
V*(f) = 1
V*(d) = 2
V*(e) = 2.5V*(c) = 4.5
V*(b) = 6
V*(s0) = 6.5
Bellman Equation
Greedy is suboptimal due to delayed effectsNeed long-term planning
✓✓
✓
✓
✓ ✓
?
18
s0
b
c
d
e
f
g
1
1
2
4
2
1
3
14 0.5
0.5
0.7
0.3
via trial-and-errorStochastic Shortest Path
Trajectory 1: s0↘ c ↗ d → g
18
s0
b
c
d
e
f
g
1
1
2
4
2
1
3
14 0.5
0.5
0.7
0.3
via trial-and-errorStochastic Shortest Path
Trajectory 1: s0↘ c ↗ d → g
18
s0
b
c
d
e
f
g
1
1
2
4
2
1
3
14 0.5
0.5
0.7
0.3
via trial-and-errorStochastic Shortest Path
Trajectory 1: s0↘ c ↗ d → g
18
s0
b
c
d
e
f
g
1
1
2
4
2
1
3
14 0.5
0.5
0.7
0.3
via trial-and-errorStochastic Shortest Path
Trajectory 1: s0↘ c ↗ d → g
18
s0
b
c
d
e
f
g
1
1
2
4
2
1
3
14 0.5
0.5
0.7
0.3
via trial-and-errorStochastic Shortest Path
Trajectory 1: s0↘ c ↗ d → g
18
s0
b
c
d
e
f
g
1
1
2
4
2
1
3
14 0.5
0.5
0.7
0.3
via trial-and-errorStochastic Shortest Path
Trajectory 1: s0↘ c ↗ d → g
18
s0
b
c
d
e
f
g
1
1
2
4
2
1
3
14 0.5
0.5
0.7
0.3
via trial-and-errorStochastic Shortest Path
1
2
4
2
1
3
14
1
19
s0
b
c
d
e
f
g
0.50.5
0.7
0.3
Trajectory 2: s0↘ c ↗ e → f ↗ g
via trial-and-error
Trajectory 1: s0↘ c ↗ d → g
Shortest PathStochastic
1
2
4
2
1
3
14
1
19
s0
b
c
d
e
f
g
0.50.5
0.7
0.3s0
c ee
f
g
Trajectory 2: s0↘ c ↗ e → f ↗ g
via trial-and-error
Trajectory 1: s0↘ c ↗ d → g
Shortest PathStochastic
20
…
s0
b
c
d
e
f
g1
2
4
2
1
3
14
1
via trial-and-error
Trajectory 2: s0↘ c ↗ e → f ↗ g
Trajectory 1: s0↘ c ↗ d → g
Shortest PathStochastic
20
…
s0
b
c
d
e
f
g1
2
4
2
1
3
14
1
via trial-and-error
Trajectory 2: s0↘ c ↗ e → f ↗ g
Trajectory 1: s0↘ c ↗ d → g
Shortest PathStochastic
20
…
0.72
0.28s0
b
c
d
e
f
g1
2
4
2
1
3
14
1
via trial-and-error
Trajectory 2: s0↘ c ↗ e → f ↗ g
Trajectory 1: s0↘ c ↗ d → g
Shortest PathStochastic
20
…
0.72
0.28s0
b
c
d
e
f
g
0.55
0.45
1
2
4
2
1
3
14
1
via trial-and-error
Trajectory 2: s0↘ c ↗ e → f ↗ g
Trajectory 1: s0↘ c ↗ d → g
Shortest PathStochastic
20
…
0.72
0.28s0
b
c
d
e
f
g
0.55
0.45
1
2
4
2
1
3
14
1
via trial-and-error
Trajectory 2: s0↘ c ↗ e → f ↗ g
Trajectory 1: s0↘ c ↗ d → g
Shortest PathStochastic
Model-based RL
20
…
0.72
0.28s0
b
c
d
e
f
g
0.55
0.45
1
2
4
2
1
3
14
1
via trial-and-error
How many trajectories do we need to compute a near-optimal policy?
Trajectory 2: s0↘ c ↗ e → f ↗ g
Trajectory 1: s0↘ c ↗ d → g
Shortest Path
sample complexity
Stochastic
Model-based RL
21
s0
b
c
d
e
f
g1
2
4
2
1
3
14
1
via trial-and-error
How many trajectories do we need to compute a near-optimal policy?
• Assume states & actions are visited uniformly
Shortest PathStochastic
21
s0
b
c
d
e
f
g1
2
4
2
1
3
14
1
via trial-and-error
How many trajectories do we need to compute a near-optimal policy?
• Assume states & actions are visited uniformly• #trajectories needed ≤ n ⋅ (#state-action pairs)
#samples needed to estimate a multinomial distribution
Shortest PathStochastic
21
via trial-and-error
How many trajectories do we need to compute a near-optimal policy?
• Assume states & actions are visited uniformly• #trajectories needed ≤ n ⋅ (#state-action pairs)
#samples needed to estimate a multinomial distribution
s0
b
c
d
e
f
g1
2
2
1
3
14
1
?
??
?
Shortest PathStochastic
4
21
via trial-and-error
How many trajectories do we need to compute a near-optimal policy?
• Assume states & actions are visited uniformly• #trajectories needed ≤ n ⋅ (#state-action pairs)
#samples needed to estimate a multinomial distribution
s0
b
c
d
e
f
g1
2
2
1
3
14
1
Nontrivial! Need exploration
?
??
?
Shortest PathStochastic
4
22
Video game playing
state st ∈ S
22
Video game playing
action at ∈ A
+20
reward rt = R (st , at)
state st ∈ S
22
Video game playing
e.g., random spawn of enemies
action at ∈ A
+20
reward rt = R (st , at)
transition dynamics P ( ⋅ | st , at)
state st ∈ S
(unknown)
22
Video game playing
e.g., random spawn of enemies
action at ∈ A
+20
reward rt = R (st , at)
transition dynamics P ( ⋅ | st , at)
state st ∈ S
policy π: S → A
objective: maximize (or ) E⇥P
H
t=1 rt |⇡⇤
E⇥P1
t=1 �t�1rt |⇡
⇤
(unknown)
22
Video game playing
5 10 15
γ = 0.8
γ = 0.9
5 10 15
H = 5
H = 10
e.g., random spawn of enemies
action at ∈ A
+20
reward rt = R (st , at)
transition dynamics P ( ⋅ | st , at)
state st ∈ S
policy π: S → A
objective: maximize (or ) E⇥P
H
t=1 rt |⇡⇤
E⇥P1
t=1 �t�1rt |⇡
⇤
(unknown)
23
Video game playing
23
Need generalization
Video game playing
23
Need generalization
Video game playing
Value function approximation
24
f (x;θ)
state features x
Need generalization
Video game playing
Value function approximation
f (x;θ)
24
f (x;θ)
state features x
state features x
θNeed generalization
Video game playing
Value function approximation
f (x;θ)
24
f (x;θ)
state features x
state features x
θFind θ s.t. f (x;θ) ≈ r + γ ⋅ Ex’|x [ f (x’;θ)]
…
Need generalization
⇒ f (⋅ ; θ) ≈ V*
Video game playing
Value function approximation
f (x;θ)
24
f (x;θ)
state features x
state features x
θ
x
x’
+r
Find θ s.t. f (x;θ) ≈ r + γ ⋅ Ex’|x [ f (x’;θ)]…
Need generalization
⇒ f (⋅ ; θ) ≈ V*
Video game playing
Value function approximation
25
value
state features x
Adaptive medical treatment
• State: diagnosis • Action: treatment • Reward: progress in recovery
A Machine Learning view of RL
27
Lecture 1: Introduction to Reinforcement Learning
About RL
Many Faces of Reinforcement Learning
Computer Science
Economics
Mathematics
Engineering Neuroscience
Psychology
Machine Learning
Classical/OperantConditioning
Optimal Control
RewardSystem
Operations Research
BoundedRationality
Reinforcement Learning
slide credit: David Silver
Supervised Learning
Given {(x(i), y(i))}, learn f : x ↦ y
28
Supervised Learning
Given {(x(i), y(i))}, learn f : x ↦ y• Online version: for round t = 1, 2,…, the learner
28
Supervised Learning
Given {(x(i), y(i))}, learn f : x ↦ y• Online version: for round t = 1, 2,…, the learner
• observes x(t)
28
^
Supervised Learning
Given {(x(i), y(i))}, learn f : x ↦ y• Online version: for round t = 1, 2,…, the learner
• observes x(t)
• predicts y(t)
28
^
Supervised Learning
Given {(x(i), y(i))}, learn f : x ↦ y• Online version: for round t = 1, 2,…, the learner
• observes x(t)
• predicts y(t)
• receives y(t)
28
^
Supervised Learning
Given {(x(i), y(i))}, learn f : x ↦ y• Online version: for round t = 1, 2,…, the learner
• observes x(t)
• predicts y(t)
• receives y(t)
• Want to maximize # of correct predictions
28
^
Supervised Learning
Given {(x(i), y(i))}, learn f : x ↦ y• Online version: for round t = 1, 2,…, the learner
• observes x(t)
• predicts y(t)
• receives y(t)
• Want to maximize # of correct predictions• e.g., classifies if an image is about a dog, a cat, a plane, etc.
(multi-class classification)
28
^
Supervised Learning
Given {(x(i), y(i))}, learn f : x ↦ y• Online version: for round t = 1, 2,…, the learner
• observes x(t)
• predicts y(t)
• receives y(t)
• Want to maximize # of correct predictions• e.g., classifies if an image is about a dog, a cat, a plane, etc.
(multi-class classification)• Dataset is fixed for everyone
28
^
Supervised Learning
Given {(x(i), y(i))}, learn f : x ↦ y• Online version: for round t = 1, 2,…, the learner
• observes x(t)
• predicts y(t)
• receives y(t)
• Want to maximize # of correct predictions• e.g., classifies if an image is about a dog, a cat, a plane, etc.
(multi-class classification)• Dataset is fixed for everyone• “Full information setting”
28
^
Supervised Learning
Given {(x(i), y(i))}, learn f : x ↦ y• Online version: for round t = 1, 2,…, the learner
• observes x(t)
• predicts y(t)
• receives y(t)
• Want to maximize # of correct predictions• e.g., classifies if an image is about a dog, a cat, a plane, etc.
(multi-class classification)• Dataset is fixed for everyone• “Full information setting”• Core challenge: generalization
28
Contextual bandits
For round t = 1, 2,…, the learner
29
Contextual bandits
For round t = 1, 2,…, the learner• Given xi , chooses from a set of actions ai ∈ A
29
Contextual bandits
For round t = 1, 2,…, the learner• Given xi , chooses from a set of actions ai ∈ A• Receives reward r(t) ~ R(x(t) , a(t)) (i.e., can be random)
29
Contextual bandits
For round t = 1, 2,…, the learner• Given xi , chooses from a set of actions ai ∈ A• Receives reward r(t) ~ R(x(t) , a(t)) (i.e., can be random)• Want to maximize total reward
29
Contextual bandits
For round t = 1, 2,…, the learner• Given xi , chooses from a set of actions ai ∈ A• Receives reward r(t) ~ R(x(t) , a(t)) (i.e., can be random)• Want to maximize total reward• You generate your own dataset {(x(t) , a(t) , r(t))}!
29
Contextual bandits
For round t = 1, 2,…, the learner• Given xi , chooses from a set of actions ai ∈ A• Receives reward r(t) ~ R(x(t) , a(t)) (i.e., can be random)• Want to maximize total reward• You generate your own dataset {(x(t) , a(t) , r(t))}!• e.g., for an image, the learner guesses a label, and is told
whether correct or not (reward = 1 if correct and 0 otherwise). Do not know what’s the true label.
29
Contextual bandits
For round t = 1, 2,…, the learner• Given xi , chooses from a set of actions ai ∈ A• Receives reward r(t) ~ R(x(t) , a(t)) (i.e., can be random)• Want to maximize total reward• You generate your own dataset {(x(t) , a(t) , r(t))}!• e.g., for an image, the learner guesses a label, and is told
whether correct or not (reward = 1 if correct and 0 otherwise). Do not know what’s the true label.
• e.g., for an user, the website recommends a movie, and observes whether the user likes it or not. Do not know what movies the user really want to see.
29
Contextual bandits
For round t = 1, 2,…, the learner• Given xi , chooses from a set of actions ai ∈ A• Receives reward r(t) ~ R(x(t) , a(t)) (i.e., can be random)• Want to maximize total reward• You generate your own dataset {(x(t) , a(t) , r(t))}!• e.g., for an image, the learner guesses a label, and is told
whether correct or not (reward = 1 if correct and 0 otherwise). Do not know what’s the true label.
• e.g., for an user, the website recommends a movie, and observes whether the user likes it or not. Do not know what movies the user really want to see.
• “Partial information setting”
29
Contextual bandits
Contextual Bandits (cont.)
30
Contextual bandits
Contextual Bandits (cont.)• Simplification: no x, Multi-Armed Bandits (MAB)
30
Contextual bandits
Contextual Bandits (cont.)• Simplification: no x, Multi-Armed Bandits (MAB)• Bandit is a research area by itself. we will not do a lot of bandits
but may go through some material that have important implications on general RL (e.g., lower bounds)
30
RL
For round t = 1, 2,…,
31
RL
For round t = 1, 2,…,
• For time step h=1, 2, …, H, the learner
31
RL
For round t = 1, 2,…,
• For time step h=1, 2, …, H, the learner
• Observes xh(t)
31
RL
For round t = 1, 2,…,
• For time step h=1, 2, …, H, the learner
• Observes xh(t)
• Chooses ah(t)
31
RL
For round t = 1, 2,…,
• For time step h=1, 2, …, H, the learner
• Observes xh(t)
• Chooses ah(t)
• Receives rh(t) ~ R(xh(t) , ah(t))
31
RL
For round t = 1, 2,…,
• For time step h=1, 2, …, H, the learner
• Observes xh(t)
• Chooses ah(t)
• Receives rh(t) ~ R(xh(t) , ah(t))
• Next xh+1(t) is generated as a function of xh(t) and ah(t)
(or sometimes, all previous x’s and a’s within round t)
31
RL
For round t = 1, 2,…,
• For time step h=1, 2, …, H, the learner
• Observes xh(t)
• Chooses ah(t)
• Receives rh(t) ~ R(xh(t) , ah(t))
• Next xh+1(t) is generated as a function of xh(t) and ah(t)
(or sometimes, all previous x’s and a’s within round t)
• Bandits + “Delayed rewards/consequences”
31
RL
For round t = 1, 2,…,
• For time step h=1, 2, …, H, the learner
• Observes xh(t)
• Chooses ah(t)
• Receives rh(t) ~ R(xh(t) , ah(t))
• Next xh+1(t) is generated as a function of xh(t) and ah(t)
(or sometimes, all previous x’s and a’s within round t)
• Bandits + “Delayed rewards/consequences”
• The protocol here is for episodic RL (each t is an episode).
31
Why statistical RL?
32
Two types of scenarios in RL research
Why statistical RL?
32
Two types of scenarios in RL research
1. Solving a large planning problem using a learning approach
Why statistical RL?
32
Two types of scenarios in RL research
1. Solving a large planning problem using a learning approach- e.g., AlphaGo, video game playing, simulated robotics
Why statistical RL?
32
Two types of scenarios in RL research
1. Solving a large planning problem using a learning approach- e.g., AlphaGo, video game playing, simulated robotics- Transition dynamics (Go rules) known, but too many states
Why statistical RL?
32
Two types of scenarios in RL research
1. Solving a large planning problem using a learning approach- e.g., AlphaGo, video game playing, simulated robotics- Transition dynamics (Go rules) known, but too many states- Run the simulator to collect data
Why statistical RL?
32
Two types of scenarios in RL research
1. Solving a large planning problem using a learning approach- e.g., AlphaGo, video game playing, simulated robotics- Transition dynamics (Go rules) known, but too many states- Run the simulator to collect data
2. Solving a learning problem
Why statistical RL?
32
Two types of scenarios in RL research
1. Solving a large planning problem using a learning approach- e.g., AlphaGo, video game playing, simulated robotics- Transition dynamics (Go rules) known, but too many states- Run the simulator to collect data
2. Solving a learning problem- e.g., adaptive medical treatment
Why statistical RL?
32
Two types of scenarios in RL research
1. Solving a large planning problem using a learning approach- e.g., AlphaGo, video game playing, simulated robotics- Transition dynamics (Go rules) known, but too many states- Run the simulator to collect data
2. Solving a learning problem- e.g., adaptive medical treatment- Transition dynamics unknown (and too many states)
Why statistical RL?
32
Two types of scenarios in RL research
1. Solving a large planning problem using a learning approach- e.g., AlphaGo, video game playing, simulated robotics- Transition dynamics (Go rules) known, but too many states- Run the simulator to collect data
2. Solving a learning problem- e.g., adaptive medical treatment- Transition dynamics unknown (and too many states)- Interact with the environment to collect data
Why statistical RL?
32
Two types of scenarios in RL research
1. Solving a large planning problem using a learning approach
2. Solving a learning problem
Why statistical RL?
33
Two types of scenarios in RL research
1. Solving a large planning problem using a learning approach
2. Solving a learning problem
• I am more interested in #2. More challenging in some aspects.
Why statistical RL?
33
Two types of scenarios in RL research
1. Solving a large planning problem using a learning approach
2. Solving a learning problem
• I am more interested in #2. More challenging in some aspects.• Data (real-world interactions) is highest priority. Computation
second.
Why statistical RL?
33
Two types of scenarios in RL research
1. Solving a large planning problem using a learning approach
2. Solving a learning problem
• I am more interested in #2. More challenging in some aspects.• Data (real-world interactions) is highest priority. Computation
second. • Even for #1, sample complexity lower-bounds computational
complexity — statistical-first approach is reasonable.
Why statistical RL?
33
Two types of scenarios in RL research
1. Solving a large planning problem using a learning approach
2. Solving a learning problem
• I am more interested in #2. More challenging in some aspects.• Data (real-world interactions) is highest priority. Computation
second. • Even for #1, sample complexity lower-bounds computational
complexity — statistical-first approach is reasonable.• caveat to this argument: you can do a lot more in a simulator;
see http://hunch.net/?p=8825714
Why statistical RL?
33
MDP Planning
Infinite-horizon discounted MDPs
35
An MDP M = (S, A, P, R, γ)
Infinite-horizon discounted MDPs
35
An MDP M = (S, A, P, R, γ)
• State space S.
Infinite-horizon discounted MDPs
35
An MDP M = (S, A, P, R, γ)
• State space S.
• Action space A.
Infinite-horizon discounted MDPs
35
An MDP M = (S, A, P, R, γ)
• State space S.
• Action space A.
Infinite-horizon discounted MDPs
35
We will only consider discrete and finite spaces in this course.
An MDP M = (S, A, P, R, γ)
• State space S.
• Action space A.
• Transition function P : S×A→∆(S). ∆(S) is the probability simplex over S, i.e., all non-negative vectors of length |S| that sums up to 1
Infinite-horizon discounted MDPs
35
We will only consider discrete and finite spaces in this course.
An MDP M = (S, A, P, R, γ)
• State space S.
• Action space A.
• Transition function P : S×A→∆(S). ∆(S) is the probability simplex over S, i.e., all non-negative vectors of length |S| that sums up to 1
• Reward function R: S×A→ℝ. (deterministic reward function)
Infinite-horizon discounted MDPs
35
We will only consider discrete and finite spaces in this course.
An MDP M = (S, A, P, R, γ)
• State space S.
• Action space A.
• Transition function P : S×A→∆(S). ∆(S) is the probability simplex over S, i.e., all non-negative vectors of length |S| that sums up to 1
• Reward function R: S×A→ℝ. (deterministic reward function)
• Discount factor γ ∈ [0,1)
Infinite-horizon discounted MDPs
35
We will only consider discrete and finite spaces in this course.
An MDP M = (S, A, P, R, γ)
• State space S.
• Action space A.
• Transition function P : S×A→∆(S). ∆(S) is the probability simplex over S, i.e., all non-negative vectors of length |S| that sums up to 1
• Reward function R: S×A→ℝ. (deterministic reward function)
• Discount factor γ ∈ [0,1)
• The agent starts in some state s1, takes action a1, receives reward r2 ~ R(s1, a1), transitions to s2 ~ P(s1, a1), takes action a2, so on so forth — the process continues indefinitely
Infinite-horizon discounted MDPs
35
We will only consider discrete and finite spaces in this course.
• Want to take actions in a way that maximizes value (or return):
Value and policy
36
𝔼 [∑∞t=1 γt−1rt]
• Want to take actions in a way that maximizes value (or return):
• This value depends on where you start and how you act
Value and policy
36
𝔼 [∑∞t=1 γt−1rt]
• Want to take actions in a way that maximizes value (or return):
• This value depends on where you start and how you act• Often assume boundedness of rewards:
Value and policy
36
rt ∈ [0, Rmax]
𝔼 [∑∞t=1 γt−1rt]
• Want to take actions in a way that maximizes value (or return):
• This value depends on where you start and how you act• Often assume boundedness of rewards:
• What’s the range of ?
Value and policy
36
rt ∈ [0, Rmax]𝔼 [∑∞
t=1 γt−1rt]
𝔼 [∑∞t=1 γt−1rt]
• Want to take actions in a way that maximizes value (or return):
• This value depends on where you start and how you act• Often assume boundedness of rewards:
• What’s the range of ?
Value and policy
36
rt ∈ [0, Rmax]𝔼 [∑∞
t=1 γt−1rt] [0,Rmax
1 − γ ]
𝔼 [∑∞t=1 γt−1rt]
• Want to take actions in a way that maximizes value (or return):
• This value depends on where you start and how you act• Often assume boundedness of rewards:
• What’s the range of ?
• A (deterministic) policy π: S→A describes how the agent acts: at state st, chooses action at = π(st).
Value and policy
36
rt ∈ [0, Rmax]𝔼 [∑∞
t=1 γt−1rt] [0,Rmax
1 − γ ]
𝔼 [∑∞t=1 γt−1rt]
• Want to take actions in a way that maximizes value (or return):
• This value depends on where you start and how you act• Often assume boundedness of rewards:
• What’s the range of ?
• A (deterministic) policy π: S→A describes how the agent acts: at state st, chooses action at = π(st).
• More generally, the agent may choose actions randomly (π: S→ ∆(A)), or even in a way that varies across time steps (“non-stationary policies”)
Value and policy
36
rt ∈ [0, Rmax]𝔼 [∑∞
t=1 γt−1rt] [0,Rmax
1 − γ ]
𝔼 [∑∞t=1 γt−1rt]
• Want to take actions in a way that maximizes value (or return):
• This value depends on where you start and how you act• Often assume boundedness of rewards:
• What’s the range of ?
• A (deterministic) policy π: S→A describes how the agent acts: at state st, chooses action at = π(st).
• More generally, the agent may choose actions randomly (π: S→ ∆(A)), or even in a way that varies across time steps (“non-stationary policies”)
• Define
Value and policy
36
rt ∈ [0, Rmax]𝔼 [∑∞
t=1 γt−1rt] [0,Rmax
1 − γ ]
𝔼 [∑∞t=1 γt−1rt]
Vπ(s) = 𝔼 [∑∞t=1 γt−1rt s1 = s, π ]
Bellman equation for policy evaluation
37
V ⇡(s) = E" 1X
t=1
�t�1rt��s1 = s,⇡
#
= E"r1 +
1X
t=2
�t�1rt��s1 = s,⇡
#
= R(s,⇡(s)) +X
s02SP (s0|s,⇡(s))E
"�
1X
t=2
�t�2rt��s1 = s, s2 = s0,⇡
#
= R(s,⇡(s)) +X
s02SP (s0|s,⇡(s))E
"�
1X
t=2
�t�2rt��s2 = s0,⇡
#
= R(s,⇡(s)) + �X
s02SP (s0|s,⇡(s))E
" 1X
t=1
�t�1rt��s1 = s0,⇡
#
= R(s,⇡(s)) + �X
s02SP (s0|s,⇡(s))V ⇡(s0)
= R(s,⇡(s)) + �hP (·|s,⇡(s)), V ⇡(·)i<latexit sha1_base64="EOqUViW3eaQRlDRi0wywMhhOOhU=">AAAFNniczVRbaxNBFJ52V63xluqjLweDJsE07AZBXwpFEXzwIV6SFjLJMjuZbIbOXpiZFcJ2+6d88Xf41hcfFPHVn+DsJoaYVqUFwYFhD+fyfd85nB0/EVxpxznZ2LTsS5evbF2tXLt+4+at6vbtvopTSVmPxiKWBz5RTPCI9TTXgh0kkpHQF2zfP3xWxPffMal4HL3Vs4QNQxJEfMIp0cblbVsv+6MMJzxvqCbswoNjAMAh0VPfz57nWLCJHmCVhl6md918hHk00TPAAQlDMsr0jpuD9DQcY58HR6A814CoFhhEwOYjeTDVQ4wrv4GWpuAh/CToXJQAlgyvG6plYqab5hJY1Q1qyUyJyN7kOXQbqn40RykzC6T1pksFf5TWOVua8jqFUf8/RZ5T2yrBhSTCubfnH4pb7nq9+TdQQaJAMAOC6TjWKzitVaAy2DR7Lct0r1pz2k554LThLowaWpyuV/2IxzFNQxZpKohSA9dJ9DAjUnMqWF7BqWIJoYckYANjRiRkapiVv30O941nDJNYmhtpKL2rFRkJlZqFvsks5qLWY4XzrNgg1ZMnw4xHSapZROdEk1SAjqF4Q2DMJaNazIxBqORGK9ApkYRq89JUzBDc9ZZPG/1O23Xa7qtHtb2ni3FsobvoHmogFz1Ge+gF6qIeotZ768T6bH2xP9if7K/2t3nq5sai5g765djffwCC5qb0</latexit><latexit sha1_base64="EOqUViW3eaQRlDRi0wywMhhOOhU=">AAAFNniczVRbaxNBFJ52V63xluqjLweDJsE07AZBXwpFEXzwIV6SFjLJMjuZbIbOXpiZFcJ2+6d88Xf41hcfFPHVn+DsJoaYVqUFwYFhD+fyfd85nB0/EVxpxznZ2LTsS5evbF2tXLt+4+at6vbtvopTSVmPxiKWBz5RTPCI9TTXgh0kkpHQF2zfP3xWxPffMal4HL3Vs4QNQxJEfMIp0cblbVsv+6MMJzxvqCbswoNjAMAh0VPfz57nWLCJHmCVhl6md918hHk00TPAAQlDMsr0jpuD9DQcY58HR6A814CoFhhEwOYjeTDVQ4wrv4GWpuAh/CToXJQAlgyvG6plYqab5hJY1Q1qyUyJyN7kOXQbqn40RykzC6T1pksFf5TWOVua8jqFUf8/RZ5T2yrBhSTCubfnH4pb7nq9+TdQQaJAMAOC6TjWKzitVaAy2DR7Lct0r1pz2k554LThLowaWpyuV/2IxzFNQxZpKohSA9dJ9DAjUnMqWF7BqWIJoYckYANjRiRkapiVv30O941nDJNYmhtpKL2rFRkJlZqFvsks5qLWY4XzrNgg1ZMnw4xHSapZROdEk1SAjqF4Q2DMJaNazIxBqORGK9ApkYRq89JUzBDc9ZZPG/1O23Xa7qtHtb2ni3FsobvoHmogFz1Ge+gF6qIeotZ768T6bH2xP9if7K/2t3nq5sai5g765djffwCC5qb0</latexit><latexit sha1_base64="EOqUViW3eaQRlDRi0wywMhhOOhU=">AAAFNniczVRbaxNBFJ52V63xluqjLweDJsE07AZBXwpFEXzwIV6SFjLJMjuZbIbOXpiZFcJ2+6d88Xf41hcfFPHVn+DsJoaYVqUFwYFhD+fyfd85nB0/EVxpxznZ2LTsS5evbF2tXLt+4+at6vbtvopTSVmPxiKWBz5RTPCI9TTXgh0kkpHQF2zfP3xWxPffMal4HL3Vs4QNQxJEfMIp0cblbVsv+6MMJzxvqCbswoNjAMAh0VPfz57nWLCJHmCVhl6md918hHk00TPAAQlDMsr0jpuD9DQcY58HR6A814CoFhhEwOYjeTDVQ4wrv4GWpuAh/CToXJQAlgyvG6plYqab5hJY1Q1qyUyJyN7kOXQbqn40RykzC6T1pksFf5TWOVua8jqFUf8/RZ5T2yrBhSTCubfnH4pb7nq9+TdQQaJAMAOC6TjWKzitVaAy2DR7Lct0r1pz2k554LThLowaWpyuV/2IxzFNQxZpKohSA9dJ9DAjUnMqWF7BqWIJoYckYANjRiRkapiVv30O941nDJNYmhtpKL2rFRkJlZqFvsks5qLWY4XzrNgg1ZMnw4xHSapZROdEk1SAjqF4Q2DMJaNazIxBqORGK9ApkYRq89JUzBDc9ZZPG/1O23Xa7qtHtb2ni3FsobvoHmogFz1Ge+gF6qIeotZ768T6bH2xP9if7K/2t3nq5sai5g765djffwCC5qb0</latexit><latexit sha1_base64="EOqUViW3eaQRlDRi0wywMhhOOhU=">AAAFNniczVRbaxNBFJ52V63xluqjLweDJsE07AZBXwpFEXzwIV6SFjLJMjuZbIbOXpiZFcJ2+6d88Xf41hcfFPHVn+DsJoaYVqUFwYFhD+fyfd85nB0/EVxpxznZ2LTsS5evbF2tXLt+4+at6vbtvopTSVmPxiKWBz5RTPCI9TTXgh0kkpHQF2zfP3xWxPffMal4HL3Vs4QNQxJEfMIp0cblbVsv+6MMJzxvqCbswoNjAMAh0VPfz57nWLCJHmCVhl6md918hHk00TPAAQlDMsr0jpuD9DQcY58HR6A814CoFhhEwOYjeTDVQ4wrv4GWpuAh/CToXJQAlgyvG6plYqab5hJY1Q1qyUyJyN7kOXQbqn40RykzC6T1pksFf5TWOVua8jqFUf8/RZ5T2yrBhSTCubfnH4pb7nq9+TdQQaJAMAOC6TjWKzitVaAy2DR7Lct0r1pz2k554LThLowaWpyuV/2IxzFNQxZpKohSA9dJ9DAjUnMqWF7BqWIJoYckYANjRiRkapiVv30O941nDJNYmhtpKL2rFRkJlZqFvsks5qLWY4XzrNgg1ZMnw4xHSapZROdEk1SAjqF4Q2DMJaNazIxBqORGK9ApkYRq89JUzBDc9ZZPG/1O23Xa7qtHtb2ni3FsobvoHmogFz1Ge+gF6qIeotZ768T6bH2xP9if7K/2t3nq5sai5g765djffwCC5qb0</latexit>
Bellman equation for policy evaluation
37
V ⇡(s) = E" 1X
t=1
�t�1rt��s1 = s,⇡
#
= E"r1 +
1X
t=2
�t�1rt��s1 = s,⇡
#
= R(s,⇡(s)) +X
s02SP (s0|s,⇡(s))E
"�
1X
t=2
�t�2rt��s1 = s, s2 = s0,⇡
#
= R(s,⇡(s)) +X
s02SP (s0|s,⇡(s))E
"�
1X
t=2
�t�2rt��s2 = s0,⇡
#
= R(s,⇡(s)) + �X
s02SP (s0|s,⇡(s))E
" 1X
t=1
�t�1rt��s1 = s0,⇡
#
= R(s,⇡(s)) + �X
s02SP (s0|s,⇡(s))V ⇡(s0)
= R(s,⇡(s)) + �hP (·|s,⇡(s)), V ⇡(·)i<latexit sha1_base64="EOqUViW3eaQRlDRi0wywMhhOOhU=">AAAFNniczVRbaxNBFJ52V63xluqjLweDJsE07AZBXwpFEXzwIV6SFjLJMjuZbIbOXpiZFcJ2+6d88Xf41hcfFPHVn+DsJoaYVqUFwYFhD+fyfd85nB0/EVxpxznZ2LTsS5evbF2tXLt+4+at6vbtvopTSVmPxiKWBz5RTPCI9TTXgh0kkpHQF2zfP3xWxPffMal4HL3Vs4QNQxJEfMIp0cblbVsv+6MMJzxvqCbswoNjAMAh0VPfz57nWLCJHmCVhl6md918hHk00TPAAQlDMsr0jpuD9DQcY58HR6A814CoFhhEwOYjeTDVQ4wrv4GWpuAh/CToXJQAlgyvG6plYqab5hJY1Q1qyUyJyN7kOXQbqn40RykzC6T1pksFf5TWOVua8jqFUf8/RZ5T2yrBhSTCubfnH4pb7nq9+TdQQaJAMAOC6TjWKzitVaAy2DR7Lct0r1pz2k554LThLowaWpyuV/2IxzFNQxZpKohSA9dJ9DAjUnMqWF7BqWIJoYckYANjRiRkapiVv30O941nDJNYmhtpKL2rFRkJlZqFvsks5qLWY4XzrNgg1ZMnw4xHSapZROdEk1SAjqF4Q2DMJaNazIxBqORGK9ApkYRq89JUzBDc9ZZPG/1O23Xa7qtHtb2ni3FsobvoHmogFz1Ge+gF6qIeotZ768T6bH2xP9if7K/2t3nq5sai5g765djffwCC5qb0</latexit><latexit sha1_base64="EOqUViW3eaQRlDRi0wywMhhOOhU=">AAAFNniczVRbaxNBFJ52V63xluqjLweDJsE07AZBXwpFEXzwIV6SFjLJMjuZbIbOXpiZFcJ2+6d88Xf41hcfFPHVn+DsJoaYVqUFwYFhD+fyfd85nB0/EVxpxznZ2LTsS5evbF2tXLt+4+at6vbtvopTSVmPxiKWBz5RTPCI9TTXgh0kkpHQF2zfP3xWxPffMal4HL3Vs4QNQxJEfMIp0cblbVsv+6MMJzxvqCbswoNjAMAh0VPfz57nWLCJHmCVhl6md918hHk00TPAAQlDMsr0jpuD9DQcY58HR6A814CoFhhEwOYjeTDVQ4wrv4GWpuAh/CToXJQAlgyvG6plYqab5hJY1Q1qyUyJyN7kOXQbqn40RykzC6T1pksFf5TWOVua8jqFUf8/RZ5T2yrBhSTCubfnH4pb7nq9+TdQQaJAMAOC6TjWKzitVaAy2DR7Lct0r1pz2k554LThLowaWpyuV/2IxzFNQxZpKohSA9dJ9DAjUnMqWF7BqWIJoYckYANjRiRkapiVv30O941nDJNYmhtpKL2rFRkJlZqFvsks5qLWY4XzrNgg1ZMnw4xHSapZROdEk1SAjqF4Q2DMJaNazIxBqORGK9ApkYRq89JUzBDc9ZZPG/1O23Xa7qtHtb2ni3FsobvoHmogFz1Ge+gF6qIeotZ768T6bH2xP9if7K/2t3nq5sai5g765djffwCC5qb0</latexit><latexit sha1_base64="EOqUViW3eaQRlDRi0wywMhhOOhU=">AAAFNniczVRbaxNBFJ52V63xluqjLweDJsE07AZBXwpFEXzwIV6SFjLJMjuZbIbOXpiZFcJ2+6d88Xf41hcfFPHVn+DsJoaYVqUFwYFhD+fyfd85nB0/EVxpxznZ2LTsS5evbF2tXLt+4+at6vbtvopTSVmPxiKWBz5RTPCI9TTXgh0kkpHQF2zfP3xWxPffMal4HL3Vs4QNQxJEfMIp0cblbVsv+6MMJzxvqCbswoNjAMAh0VPfz57nWLCJHmCVhl6md918hHk00TPAAQlDMsr0jpuD9DQcY58HR6A814CoFhhEwOYjeTDVQ4wrv4GWpuAh/CToXJQAlgyvG6plYqab5hJY1Q1qyUyJyN7kOXQbqn40RykzC6T1pksFf5TWOVua8jqFUf8/RZ5T2yrBhSTCubfnH4pb7nq9+TdQQaJAMAOC6TjWKzitVaAy2DR7Lct0r1pz2k554LThLowaWpyuV/2IxzFNQxZpKohSA9dJ9DAjUnMqWF7BqWIJoYckYANjRiRkapiVv30O941nDJNYmhtpKL2rFRkJlZqFvsks5qLWY4XzrNgg1ZMnw4xHSapZROdEk1SAjqF4Q2DMJaNazIxBqORGK9ApkYRq89JUzBDc9ZZPG/1O23Xa7qtHtb2ni3FsobvoHmogFz1Ge+gF6qIeotZ768T6bH2xP9if7K/2t3nq5sai5g765djffwCC5qb0</latexit><latexit sha1_base64="EOqUViW3eaQRlDRi0wywMhhOOhU=">AAAFNniczVRbaxNBFJ52V63xluqjLweDJsE07AZBXwpFEXzwIV6SFjLJMjuZbIbOXpiZFcJ2+6d88Xf41hcfFPHVn+DsJoaYVqUFwYFhD+fyfd85nB0/EVxpxznZ2LTsS5evbF2tXLt+4+at6vbtvopTSVmPxiKWBz5RTPCI9TTXgh0kkpHQF2zfP3xWxPffMal4HL3Vs4QNQxJEfMIp0cblbVsv+6MMJzxvqCbswoNjAMAh0VPfz57nWLCJHmCVhl6md918hHk00TPAAQlDMsr0jpuD9DQcY58HR6A814CoFhhEwOYjeTDVQ4wrv4GWpuAh/CToXJQAlgyvG6plYqab5hJY1Q1qyUyJyN7kOXQbqn40RykzC6T1pksFf5TWOVua8jqFUf8/RZ5T2yrBhSTCubfnH4pb7nq9+TdQQaJAMAOC6TjWKzitVaAy2DR7Lct0r1pz2k554LThLowaWpyuV/2IxzFNQxZpKohSA9dJ9DAjUnMqWF7BqWIJoYckYANjRiRkapiVv30O941nDJNYmhtpKL2rFRkJlZqFvsks5qLWY4XzrNgg1ZMnw4xHSapZROdEk1SAjqF4Q2DMJaNazIxBqORGK9ApkYRq89JUzBDc9ZZPG/1O23Xa7qtHtb2ni3FsobvoHmogFz1Ge+gF6qIeotZ768T6bH2xP9if7K/2t3nq5sai5g765djffwCC5qb0</latexit>
Bellman equation for policy evaluation
38
Matrix form: define
• Vπ as the |S|×1 vector [Vπ(s)]s∈S
Vπ(s) = R(s, π(s)) + γ⟨P( ⋅ |s, π(s)), Vπ( ⋅ )⟩
Bellman equation for policy evaluation
38
Matrix form: define
• Vπ as the |S|×1 vector [Vπ(s)]s∈S
• Rπ as the vector [R(s, π(s))]s∈S
Vπ(s) = R(s, π(s)) + γ⟨P( ⋅ |s, π(s)), Vπ( ⋅ )⟩
Bellman equation for policy evaluation
38
Matrix form: define
• Vπ as the |S|×1 vector [Vπ(s)]s∈S
• Rπ as the vector [R(s, π(s))]s∈S
• Pπ as the matrix [P(s’|s, π(s))]s∈S, s’∈S
Vπ(s) = R(s, π(s)) + γ⟨P( ⋅ |s, π(s)), Vπ( ⋅ )⟩
Bellman equation for policy evaluation
38
Matrix form: define
• Vπ as the |S|×1 vector [Vπ(s)]s∈S
• Rπ as the vector [R(s, π(s))]s∈S
• Pπ as the matrix [P(s’|s, π(s))]s∈S, s’∈S
Vπ(s) = R(s, π(s)) + γ⟨P( ⋅ |s, π(s)), Vπ( ⋅ )⟩
Vπ = Rπ + γPπVπ
Bellman equation for policy evaluation
38
Matrix form: define
• Vπ as the |S|×1 vector [Vπ(s)]s∈S
• Rπ as the vector [R(s, π(s))]s∈S
• Pπ as the matrix [P(s’|s, π(s))]s∈S, s’∈S
Vπ(s) = R(s, π(s)) + γ⟨P( ⋅ |s, π(s)), Vπ( ⋅ )⟩
Vπ = Rπ + γPπVπ
(I − γPπ)Vπ = Rπ
Bellman equation for policy evaluation
38
Matrix form: define
• Vπ as the |S|×1 vector [Vπ(s)]s∈S
• Rπ as the vector [R(s, π(s))]s∈S
• Pπ as the matrix [P(s’|s, π(s))]s∈S, s’∈S
Vπ(s) = R(s, π(s)) + γ⟨P( ⋅ |s, π(s)), Vπ( ⋅ )⟩
Vπ = Rπ + γPπVπ
(I − γPπ)Vπ = Rπ
Vπ = (I − γPπ)−1Rπ
Bellman equation for policy evaluation
38
Matrix form: define
• Vπ as the |S|×1 vector [Vπ(s)]s∈S
• Rπ as the vector [R(s, π(s))]s∈S
• Pπ as the matrix [P(s’|s, π(s))]s∈S, s’∈S
Vπ(s) = R(s, π(s)) + γ⟨P( ⋅ |s, π(s)), Vπ( ⋅ )⟩
Vπ = Rπ + γPπVπ
(I − γPπ)Vπ = Rπ
Vπ = (I − γPπ)−1Rπ
This is always invertible. Proof?
State occupancy
39
(I − γPπ)−1
Each row (indexed by s) is the discounted state occupancy dsπ,
whose (s’)-th entry is
State occupancy
39
(I − γPπ)−1
Each row (indexed by s) is the discounted state occupancy dsπ,
whose (s’)-th entry is
State occupancy
39
(I − γPπ)−1
dπs (s′�) = 𝔼 [
∞
∑t=1
γt−1 𝕀[st = s′�] s1 = s, π ]
Each row (indexed by s) is the discounted state occupancy dsπ,
whose (s’)-th entry is
• Each row is like a distribution vector—except that the entries sum up to 1/(1-γ). Let denote the normalized vector.
State occupancy
39
(I − γPπ)−1
dπs (s′�) = 𝔼 [
∞
∑t=1
γt−1 𝕀[st = s′�] s1 = s, π ]ηπ
s = (1 − γ) dπs
Each row (indexed by s) is the discounted state occupancy dsπ,
whose (s’)-th entry is
• Each row is like a distribution vector—except that the entries sum up to 1/(1-γ). Let denote the normalized vector.
• Vπ(s) is the dot product between and reward vector
State occupancy
39
(I − γPπ)−1
dπs (s′�) = 𝔼 [
∞
∑t=1
γt−1 𝕀[st = s′�] s1 = s, π ]ηπ
s = (1 − γ) dπs
dπs
Each row (indexed by s) is the discounted state occupancy dsπ,
whose (s’)-th entry is
• Each row is like a distribution vector—except that the entries sum up to 1/(1-γ). Let denote the normalized vector.
• Vπ(s) is the dot product between and reward vector
• Can also be interpreted as the value function of indicator reward function
State occupancy
39
(I − γPπ)−1
dπs (s′�) = 𝔼 [
∞
∑t=1
γt−1 𝕀[st = s′�] s1 = s, π ]ηπ
s = (1 − γ) dπs
dπs
• For infinite-horizon discounted MDPs, there always exists a stationary and deterministic policy that is optimal for all starting states simultaneously• Proof: Puterman’94, Thm 6.2.7 (reference due to Shipra Agrawal)
Optimality
40
• For infinite-horizon discounted MDPs, there always exists a stationary and deterministic policy that is optimal for all starting states simultaneously• Proof: Puterman’94, Thm 6.2.7 (reference due to Shipra Agrawal)
• Let π* denote this optimal policy, and V* := Vπ*
Optimality
40
• For infinite-horizon discounted MDPs, there always exists a stationary and deterministic policy that is optimal for all starting states simultaneously• Proof: Puterman’94, Thm 6.2.7 (reference due to Shipra Agrawal)
• Let π* denote this optimal policy, and V* := Vπ*
• Bellman Optimality Equation:
Optimality
40
V⋆(s) = maxa∈A (R(s, a) + γ𝔼s′�∼P(s,a) [V⋆(s′�)])
• For infinite-horizon discounted MDPs, there always exists a stationary and deterministic policy that is optimal for all starting states simultaneously• Proof: Puterman’94, Thm 6.2.7 (reference due to Shipra Agrawal)
• Let π* denote this optimal policy, and V* := Vπ*
• Bellman Optimality Equation:
• If we know V*, how to get π* ?
Optimality
40
V⋆(s) = maxa∈A (R(s, a) + γ𝔼s′�∼P(s,a) [V⋆(s′�)])
• For infinite-horizon discounted MDPs, there always exists a stationary and deterministic policy that is optimal for all starting states simultaneously• Proof: Puterman’94, Thm 6.2.7 (reference due to Shipra Agrawal)
• Let π* denote this optimal policy, and V* := Vπ*
• Bellman Optimality Equation:
• If we know V*, how to get π* ?
Optimality
40
V⋆(s) = maxa∈A (R(s, a) + γ𝔼s′�∼P(s,a) [V⋆(s′�)])
• For infinite-horizon discounted MDPs, there always exists a stationary and deterministic policy that is optimal for all starting states simultaneously• Proof: Puterman’94, Thm 6.2.7 (reference due to Shipra Agrawal)
• Let π* denote this optimal policy, and V* := Vπ*
• Bellman Optimality Equation:
• If we know V*, how to get π* ?
• Easier to work with Q-values: Q*(s, a), as
Optimality
40
V⋆(s) = maxa∈A (R(s, a) + γ𝔼s′�∼P(s,a) [V⋆(s′�)])
π⋆(s) = arg maxa∈A
Q⋆(s, a)
• For infinite-horizon discounted MDPs, there always exists a stationary and deterministic policy that is optimal for all starting states simultaneously• Proof: Puterman’94, Thm 6.2.7 (reference due to Shipra Agrawal)
• Let π* denote this optimal policy, and V* := Vπ*
• Bellman Optimality Equation:
• If we know V*, how to get π* ?
• Easier to work with Q-values: Q*(s, a), as
Optimality
40
V⋆(s) = maxa∈A (R(s, a) + γ𝔼s′�∼P(s,a) [V⋆(s′�)])
π⋆(s) = arg maxa∈A
Q⋆(s, a)
Q⋆(s, a) = R(s, a) + γ𝔼s′�∼P(s,a) [maxa′�∈A
Q⋆(s′�, a′�)]
Ad Hoc Homework 1
41
• uploaded on course website
Ad Hoc Homework 1
41
• uploaded on course website
• help understand the relationships between alternative MDP formulations
Ad Hoc Homework 1
41
• uploaded on course website
• help understand the relationships between alternative MDP formulations
• more like readings w/ questions to think about
Ad Hoc Homework 1
41
• uploaded on course website
• help understand the relationships between alternative MDP formulations
• more like readings w/ questions to think about
• no need to submit
Ad Hoc Homework 1
41
• uploaded on course website
• help understand the relationships between alternative MDP formulations
• more like readings w/ questions to think about
• no need to submit
• will go through in class next week
Ad Hoc Homework 1
41
Recommended