Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
Safe Reinforcement Learning Philip S. Thomas
Stanford CS234: Reinforcement Learning, Guest Lecture
May 24, 2017
Lecture overview
• What makes a reinforcement learning algorithm safe?
• Notation
• Creating a safe reinforcement learning algorithm • Off-policy policy evaluation (OPE)
• High-confidence off-policy policy evaluation (HCOPE)
• Safe policy improvement (SPI)
• Empirical results
• Research directions
What does it mean for a reinforcement learning algorithm to be safe?
Changing the objective
-50 +0 +20 +20
+0 +0 +0 +0 +0 +0 +20
+20 +20 +20
Policy 1
Policy 2
Changing the objective
• Policy 1: • Reward = 0 with probability 0.999999
• Reward = 109 with probability 1-0.999999
• Expected reward approximately 1000
• Policy 2: • Reward = 999 with probability 0.5
• Reward = 1000 with probability 0.5
• Expected reward 999.5
Another notion of safety
Another notion of safety (Munos et. al)
Another notion of safety
The Problem
• If you apply an existing method, do you have confidence that it will work?
Reinforcement learning successes
A property of many real applications
• Deploying “bad” policies can be costly or dangerous.
Deploying bad policies can be costly
Deploying bad policies can be dangerous
What property should a safe algorithm have?
• Guaranteed to work on the first try • “I guarantee that with probability at least 1 − 𝛿, I will not change your policy
to one that is worse than the current policy.”
• You get to choose 𝛿
• This guarantee is not contingent on the tuning of any hyperparameters
Lecture overview
• What makes a reinforcement learning algorithm safe?
• Notation
• Creating a safe reinforcement learning algorithm • Off-policy policy evaluation (OPE)
• High-confidence off-policy policy evaluation (HCOPE)
• Safe policy improvement (SPI)
• Empirical results
• Research directions
Notation
• Policy, 𝜋 𝜋 𝑎 𝑠 = Pr(𝐴𝑡 = 𝑎|𝑆𝑡 = 𝑠)
• History: 𝐻 = 𝑠1, 𝑎1, 𝑟1, 𝑠2, 𝑎2, 𝑟2, … , 𝑠𝐿 , 𝑎𝐿 , 𝑟𝐿
• Historical data: 𝐷 = 𝐻1, 𝐻2, … , 𝐻𝑛
• Historical data from behavior policy, 𝜋b
• Objective:
𝐽 𝜋 = 𝐄 𝛾𝑡𝑅𝑡
𝐿
𝑡=1
𝜋 19
Agent
Environment
Action, 𝑎
State, 𝑠 Reward, 𝑟
Safe reinforcement learning algorithm
• Reinforcement learning algorithm, 𝑎
• Historical data, 𝐷, which is a random variable
• Policy produced by the algorithm, 𝑎(𝐷), which is a random variable
• A safe reinforcement learning algorithm, 𝑎, satisfies:
Pr 𝐽 𝑎 𝐷 ≥ 𝐽 𝜋b ≥ 1 − 𝛿
or, in general: Pr 𝐽 𝑎 𝐷 ≥ 𝐽min ≥ 1 − 𝛿
Lecture overview
• What makes a reinforcement learning algorithm safe?
• Notation
• Creating a safe reinforcement learning algorithm • Off-policy policy evaluation (OPE)
• High-confidence off-policy policy evaluation (HCOPE)
• Safe policy improvement (SPI)
• Empirical results
• Research directions
Creating a safe reinforcement learning algorithm • Off-policy policy evaluation (OPE)
• For any evaluation policy, 𝜋e, Convert historical data, 𝐷, into 𝑛 independent and unbiased estimates of 𝐽 𝜋e
• High-confidence off-policy policy evaluation (HCOPE) • Use a concentration inequality to convert the 𝑛 independent and unbiased
estimates of 𝐽 𝜋e into a 1 − 𝛿 confidence lower bound on 𝐽 𝜋e
• Safe policy improvement (SPI) • Use HCOPE method to create a safe reinforcement learning algorithm, 𝑎
Off-policy policy evaluation (OPE)
Historical Data, 𝐷
Proposed Policy, 𝜋e Estimate of 𝐽(𝜋e)
Importance Sampling (Intuition)
24
Probability of history
Evaluation Policy, 𝜋e Behavior Policy, 𝜋b
𝐽 𝜋𝑒 =1
𝑛 𝛾𝑡𝑅𝑡
𝑖
𝐿
𝑡=1
𝑛
𝑖=1
𝐽 𝜋e =1
𝑛 𝑤𝑖 𝛾𝑡𝑅𝑡
𝑖
𝐿
𝑡=1
𝑛
𝑖=1
Math Slide 2/3
𝑤𝑖 =Pr 𝐻𝑖 𝜋e
Pr 𝐻𝑖 𝜋b
Math Slide 2/3
• Reminder: • History, 𝐻 = 𝑠1, 𝑎1, 𝑟1, 𝑠2, 𝑎2, 𝑟2, … , 𝑠𝐿 , 𝑎𝐿 , 𝑟𝐿
• Objective, 𝐽 𝜋e = 𝐄 𝛾𝑡𝑅𝑡𝐿𝑡=1 𝜋e
= 𝜋e 𝑎𝑡 𝑠𝑡
𝜋b 𝑎𝑡 𝑠𝑡
𝐿
𝑡=1
Importance sampling (History)
• Kahn, H., Marhshall, A. W. (1953). Methods of reducing sample size in Monte Carlo computations. In Journal of the Operations Research Society of America, 1(5):263–278 • Let 𝑋 = 0 with probability 1 − 10−10 and 𝑋 = 1010 with probability 10−10
• 𝐄 𝑋 = 1
• Monte Carlo estimate from 𝑛 ≪ 1010 samples of 𝑋 is almost always zero
• Idea: Sample 𝑋 from some other distribution and use importance sampling to “correct” the estimate
• Can produce lower variance estimates.
• Josiah Hannah et. al, ICML 2017 (to appear).
Importance sampling (History, continued)
• Precup, D., Sutton, R. S., Singh, S. (2000). Eligibility traces for off-policy policy evaluation. In Proceedings of the 17th International Conference on Machine Learning, pp. 759–766. Morgan Kaufmann
Importance sampling (Proof)
• Estimate 𝐄𝑝[𝑓 𝑋 ] given a sample of 𝑋~𝑞
• Let 𝑃 = supp 𝑝 , 𝑄 = supp(𝑞), and 𝐹 = supp(𝑓)
• Importance sampling estimate: 𝑝 𝑋
𝑞 𝑋𝑓 𝑋
𝐄𝑞
𝑝(𝑋)
𝑞(𝑋)𝑓(𝑋) = 𝑞(𝑋)
𝑝(𝑋)
𝑞(𝑋)𝑓(𝑋)
𝑥∈𝑄
= 𝑝(𝑋)𝑓(𝑋)
𝑥∈𝑃
− 𝑝 𝑋 𝑓 𝑋
𝑥∈𝑃∩𝑄
= 𝑝(𝑋)𝑓(𝑋)
𝑥∈𝑃
+ 𝑝(𝑋)𝑓(𝑋)
𝑥∈𝑃 ∩𝑄
− 𝑝 𝑋 𝑓 𝑋
𝑥∈𝑃∩𝑄
Importance sampling (Proof)
• Assume 𝑃 ⊆ 𝑄 (can relax assumption to 𝑃 ⊆ 𝑄 ∪ 𝐹 )
• Importance sampling is an unbiased estimator of 𝐄𝑝 𝑓 𝑋
𝐄𝑞
𝑝(𝑋)
𝑞(𝑋)𝑓(𝑋) = 𝑝(𝑋)𝑓(𝑋)
𝑥∈𝑃
− 𝑝 𝑋 𝑓 𝑋
𝑥∈𝑃∩𝑄
= 𝐄𝑝 𝑓 𝑋
= 𝑝(𝑋)𝑓(𝑋)
𝑥∈𝑃
Importance sampling (proof)
• Assume 𝑓 𝑥 ≥ 0 for all 𝑥
• Importance sampling is a negative-bias estimator of 𝐄𝑝 𝑓 𝑋
𝐄𝑞
𝑝(𝑋)
𝑞(𝑋)𝑓(𝑋) = 𝑝(𝑋)𝑓(𝑋)
𝑥∈𝑃
− 𝑝 𝑋 𝑓 𝑋
𝑥∈𝑃∩𝑄
≤ 𝑝(𝑋)𝑓(𝑋)
𝑥∈𝑃
= 𝐄𝑝 𝑓 𝑋
Importance sampling (reminder)
IS 𝐷 =1
𝑛
𝜋e 𝑎𝑡 𝑠𝑡
𝜋b 𝑎𝑡 𝑠𝑡
𝐿
𝑡=1
𝛾𝑡𝑅𝑡𝑖
𝐿
𝑡=1
𝑛
𝑖=1
𝐄 IS(𝐷) = 𝐽 𝜋e
Creating a safe reinforcement learning algorithm • Off-policy policy evaluation (OPE)
• For any evaluation policy, 𝜋e, Convert historical data, 𝐷, into 𝑛 independent and unbiased estimates of 𝐽 𝜋e
• High-confidence off-policy policy evaluation (HCOPE) • Use a concentration inequality to convert the 𝑛 independent and unbiased
estimates of 𝐽 𝜋e into a 1 − 𝛿 confidence lower bound on 𝐽 𝜋e
• Safe policy improvement (SPI) • Use HCOPE method to create a safe reinforcement learning algorithm, 𝑎
High confidence off-policy policy evaluation (HCOPE)
Historical Data, 𝐷
Proposed Policy, 𝜋𝑒 1 − 𝛿 confidence lower bound on 𝐽(𝜋𝑒)
Probability, 1 − 𝛿
• Let 𝑋1, … , 𝑋𝑛 be 𝑛 independent identically distributed random variables such that𝑋i ∈ [0, 𝑏]
• Then with probability at least 1 − 𝛿:
𝐄 𝑋𝑖 ≥1
𝑛 𝑋𝑖 −
𝑛
𝑖=1
𝑏ln 1
𝛿
2𝑛
Hoeffding’s inequality
Math Slide 3/3
1
𝑛 𝑤𝑖 𝛾𝑡𝑅𝑡
𝑖
𝐿
𝑡=1
𝑛
𝑖=1
Creating a safe reinforcement learning algorithm • Off-policy policy evaluation (OPE)
• For any evaluation policy, 𝜋e, Convert historical data, 𝐷, into 𝑛 independent and unbiased estimates of 𝐽 𝜋e
• High-confidence off-policy policy evaluation (HCOPE) • Use a concentration inequality to convert the 𝑛 independent and unbiased
estimates of 𝐽 𝜋e into a 1 − 𝛿 confidence lower bound on 𝐽 𝜋e
• Safe policy improvement (SPI) • Use HCOPE method to create a safe reinforcement learning algorithm, 𝑎
Safe policy improvement (SPI)
Historical Data, 𝐷 New policy 𝜋, or No Solution Found Probability, 1 − 𝛿
Historical Data
Training Set (20%)
Candidate Policy, 𝜋
Testing Set (80%)
Safety Test
36
Safe policy improvement (SPI)
Is 1 − 𝛿 confidence lower bound on 𝐽 𝜋 larger that 𝐽(𝜋cur)?
Creating a safe reinforcement learning algorithm • Off-policy policy evaluation (OPE)
• For any evaluation policy, 𝜋e, Convert historical data, 𝐷, into 𝑛 independent and unbiased estimates of 𝐽 𝜋e
• High-confidence off-policy policy evaluation (HCOPE) • Use a concentration inequality to convert the 𝑛 independent and unbiased
estimates of 𝐽 𝜋e into a 1 − 𝛿 confidence lower bound on 𝐽 𝜋e
• Safe policy improvement (SPI) • Use HCOPE method to create a safe reinforcement learning algorithm, 𝑎
WON’T WORK
Off-policy policy evaluation (revisited)
• Importance sampling (IS):
IS 𝐷 =1
𝑛
𝜋e 𝑎𝑡 𝑠𝑡
𝜋b 𝑎𝑡 𝑠𝑡
𝐿
𝑡=1
𝛾𝑡𝑅𝑡𝑖
𝐿
𝑡=1
𝑛
𝑖=1
• Per-decision importance sampling (PDIS)
PDIS 𝐷 = 𝛾𝑡
𝐿
𝑡=1
1
𝑛
𝜋e 𝑎𝜏 𝑠𝜏
𝜋b 𝑎𝜏 𝑠𝜏
𝑡
𝜏=1
𝑅𝑡𝑖
𝑛
𝑖=1
Off-policy policy evaluation (revisited)
• Importance sampling (IS):
IS 𝐷 =1
𝑛 𝑤𝑖 𝛾𝑡𝑅𝑡
𝑖
𝐿
𝑡=1
𝑛
𝑖=1
• Weighted importance sampling (WIS)
WIS 𝐷 =1
𝑤𝑖𝑛𝑖=1
𝑤𝑖 𝛾𝑡𝑅𝑡𝑖
𝐿
𝑡=1
𝑛
𝑖=1
Off-policy policy evaluation (revisited)
• Weighted importance sampling (WIS)
1
𝑤𝑖𝑛𝑖=1
𝑤𝑖 𝛾𝑡𝑅𝑡𝑖
𝐿
𝑡=1
𝑛
𝑖=1
• Not unbiased. When 𝑛 = 1, 𝐄 WIS = 𝐽 𝜋b
• Strongly consistent estimator of 𝐽 𝜋e
• i.e., Pr lim𝑛→∞
WIS(𝐷) = J 𝜋e = 1
• If • Finite horizon • One behavior policy, or bounded rewards
Off-policy policy evaluation (revisited)
• Weighted per-decision importance sampling • Also called consistent weighted per-decision importance sampling
• A fun exercise!
Control variates
• Given: 𝑋
• Estimate: 𝜇 = 𝐄 𝑋
• 𝜇 = 𝑋
• Unbiased: 𝐄 𝜇 = 𝐄 𝑋 = 𝜇
• Variance: Var 𝜇 = Var(𝑋)
Control variates
• Given: 𝑋, 𝑌, 𝐄 𝑌
• Estimate: 𝜇 = 𝐄 𝑋
• 𝜇 = 𝑋 − 𝑌 + 𝐄[Y]
• Unbiased: 𝐄 𝜇 = 𝐄 𝑋 − 𝑌 + 𝐄[𝑌] = 𝐄 𝑋 − 𝐄 𝑌 + 𝐄 𝑌 = 𝐄 𝑋 = 𝜇
• Variance: Var 𝜇 = Var 𝑋 − 𝑌 + 𝐄[𝑌] = Var 𝑋 − 𝑌 = Var 𝑋 + Var 𝑌 − 2Cov(𝑋, 𝑌)
• Lower variance if 2Cov 𝑋, 𝑌 > Var(𝑌)
• We call 𝑌 a control variate.
Off-policy policy evaluation (revisited)
• Idea: add a control variate to importance sampling estimators • 𝑋 is the importance sampling estimator • 𝑌 is a control variate build from an approximate model of the MDP
• 𝐄 𝑌 = 0 in this case
• PDISCV 𝐷 = PDIS 𝐷 − CV(𝐷)
• Called the doubly robust estimator (Jiang and Li, 2015) • Robust to 1) poor approximate model, and 2) error in estimates of 𝜋b
• If the model is poor, the estimates are still unbiased • If the sampling policy is unknown, but the model is good, MSE will still be low
• DR 𝐷 = PDISCV 𝐷
• Non-recursive and weighted forms, as well as control variate view provided by Thomas and Brunskill (2016)
Off-policy policy evaluation (revisited)
DR 𝜋𝑒 𝒟) = 1
𝑛 𝛾𝑡𝑤𝑡
𝑖 𝑅𝑡𝑖 −𝑞 𝜋e 𝑆𝑡
𝑖 , 𝐴𝑡𝑖 + 𝛾𝑡𝜌𝑡−1
𝑖 𝑣 𝜋𝑒 𝑆𝑡𝑖
∞
𝑡=0
𝑛
𝑖=1
• Recall: we want the control variate, 𝑌, to cancel with 𝑋: 𝑅 − 𝑞 𝑆, 𝐴 + 𝛾𝑣 𝑆′
Per-decision importance sampling (PDIS) 𝑤𝑡
𝑖 = 𝜋e 𝑎𝜏 𝑠𝜏
𝜋b 𝑎𝜏 𝑠𝜏
𝑡
𝜏=1
Empirical Results (Gridworld)
0.001
0.01
0.1
1
10
100
1000
10000
2 20 200 2,000
Mea
n S
qu
ared
Err
or
Number of Episodes, n
IS
AM
Approximate model Direct method (Dudik, 2011) Indirect method (Sutton and Barto, 1998)
Empirical Results (Gridworld)
0.001
0.01
0.1
1
10
100
1000
10000
2 20 200 2,000
Mea
n S
qu
ared
Err
or
Number of Episodes, n
IS
PDIS
AM
Empirical Results (Gridworld)
0.001
0.01
0.1
1
10
100
1000
10000
2 20 200 2,000
Mea
n S
qu
ared
Err
or
Number of Episodes, n
IS
PDIS
DR
AM
Empirical Results (Gridworld)
0.001
0.01
0.1
1
10
100
1000
10000
2 20 200 2,000
Mea
n S
qu
ared
Err
or
Number of Episodes, n
IS
PDIS
WIS
CWPDIS
DR
AM
Empirical Results (Gridworld)
0.001
0.01
0.1
1
10
100
1000
10000
2 20 200 2,000
Mea
n S
qu
ared
Err
or
Number of Episodes, n
IS
PDIS
WIS
CWPDIS
DR
AM
WDR
Off-policy policy evaluation (revisited)
• What if supp 𝜋e ⊂ supp 𝜋b ?
• There is a state-action pair, 𝑠, 𝑎 , such that
𝜋𝑒 𝑎 𝑠 = 0, but 𝜋𝑏 𝑎 𝑠 ≠ 0
• If we see a history where (𝑠, 𝑎) occurs, what weight should we give it?
IS 𝐷 =1
𝑛
𝜋e 𝑎𝑡 𝑠𝑡
𝜋b 𝑎𝑡 𝑠𝑡
𝐿
𝑡=1
𝛾𝑡𝑅𝑡𝑖
𝐿
𝑡=1
𝑛
𝑖=1
Off-policy policy evaluation (revisited)
• What if there are zero samples (𝑛 = 0)? • The importance sampling estimate is undefined
• What if no samples are in supp 𝜋e (or supp(𝑝) in general)? • Importance sampling says: the estimate is zero • Alternate approach: undefined
• Importance sampling estimator is unbiased if 𝑛 > 0
• Alternate approach will be unbiased given that at least one sample is in the support of 𝑝.
• Alternate approach detailed in Importance Sampling with Unequal Support (Thomas and Brunskill, AAAI 2017)
Off-policy policy evaluation (revisited)
Off-policy policy evaluation (revisited)
• Thomas et. al. Predictive Off-Policy Policy Evaluation for Nonstationary Decision Problems, with Applications to Digital Marketing (AAAI 2017)
Off-policy policy evaluation (revisited)
• Thomas and Brunskill. Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning (ICML 2016)
0.001
0.01
0.1
1
10
100
1000
10000
2 20 200 2,000
Mea
n S
qu
ared
Err
or
Number of Episodes, n
IS
PDIS
WIS
CWPDIS
DR
AM
WDR
MAGIC
0.01
0.1
1
10
1 10 100 1,000 10,000
Mea
n S
qu
ared
Err
or
Number of Episodes, n
IS
DR
AM
WDR
MAGIC
Creating a safe reinforcement learning algorithm • Off-policy policy evaluation (OPE)
• For any evaluation policy, 𝜋e, Convert historical data, 𝐷, into 𝑛 independent and unbiased estimates of 𝐽 𝜋e
• High-confidence off-policy policy evaluation (HCOPE) • Use a concentration inequality to convert the 𝑛 independent and unbiased
estimates of 𝐽 𝜋e into a 1 − 𝛿 confidence lower bound on 𝐽 𝜋e
• Safe policy improvement (SPI) • Use HCOPE method to create a safe reinforcement learning algorithm, 𝑎
High-confidence off-policy policy evaluation (revisited) • Consider using IS + Hoeffding’s inequality for HCOPE on mountain car
Natural Temporal Difference Learning, Dabney and Thomas, 2014
High-confidence off-policy policy evaluation (revisited) • Using 100,000 trajectories
• Evaluation policy’s true performance is 0.19 ∈ [0,1].
• We get a 95% confidence lower bound of:
−5,831,000
What went wrong?
𝑤𝑖 = 𝜋e 𝑎𝑡 𝑠𝑡
𝜋b 𝑎𝑡 𝑠𝑡
𝐿
𝑡=1
What went wrong?
𝐄 𝑋𝑖 ≥1
𝑛 𝑋𝑖 −
𝑛
𝑖=1
𝑏ln 1
𝛿
2𝑛
𝑏 ≈ 109.4
Largest observed importance weighted return:316.
High-confidence off-policy policy evaluation (revisited) • Removing the upper tail only decreases the expected value.
High-confidence off-policy policy evaluation (revisited)
• Thomas et. al, High confidence off-policy evaluation, AAAI 2015
High-confidence off-policy policy evaluation (revisited)
High-confidence off-policy policy evaluation (revisited) • Use 20% of the data to optimize 𝑐.
• Use 80% to compute lower bound with optimized 𝑐.
• Mountain car results:
CUT Chernoff-Hoeffding Maurer Anderson Bubeck et al.
95% Confidence lower bound on
the mean
0.145 −5,831,000 −129,703 0.055 −.046
High-confidence off-policy policy evaluation (revisited) • Digital Marketing:
High-confidence off-policy policy evaluation (revisited) • Cognitive dissonance
𝐄 𝑋𝑖 ≥1
𝑛 𝑋𝑖 −
𝑛
𝑖=1
𝑏ln 1
𝛿
2𝑛
High-confidence off-policy policy evaluation (revisited) • Student’s 𝑡-test
• Assumes that IS(𝐷) is normally distributed
• By the central limit theorem, it (is as 𝑛 → ∞)
• Efron’s Bootstrap methods (e.g., BCa) • Also, without importance sampling: Hanna, Stone, and Niekum, AAMAS 2017
High-confidence off-policy policy evaluation (revisited)
P. S. Thomas. Safe reinforcement learning (PhD Thesis, 2015)
Creating a safe reinforcement learning algorithm • Off-policy policy evaluation (OPE)
• For any evaluation policy, 𝜋e, Convert historical data, 𝐷, into 𝑛 independent and unbiased estimates of 𝐽 𝜋e
• High-confidence off-policy policy evaluation (HCOPE) • Use a concentration inequality to convert the 𝑛 independent and unbiased
estimates of 𝐽 𝜋e into a 1 − 𝛿 confidence lower bound on 𝐽 𝜋e
• Safe policy improvement (SPI) • Use HCOPE method to create a safe reinforcement learning algorithm, 𝑎
Historical Data
Training Set (20%)
Candidate Policy, 𝜋
Testing Set (80%)
Safety Test
70
Safe policy improvement (revisited)
Is 1 − 𝛿 confidence lower bound on 𝐽 𝜋 larger that 𝐽(𝜋cur)?
• Thomas et. al, ICML 2015
Lecture overview
• What makes a reinforcement learning algorithm safe?
• Notation
• Creating a safe reinforcement learning algorithm • Off-policy policy evaluation (OPE)
• High-confidence off-policy policy evaluation (HCOPE)
• Safe policy improvement (SPI)
• Empirical results
• Research directions
Empirical Results
• What to look for: • Data efficiency
• Error rates
Empirical Results: Mountain Car
Empirical Results: Mountain Car
Empirical Results: Mountain Car
Empirical Results: Digital Marketing
Agent
Environment
Action, 𝑎
State, 𝑠 Reward, 𝑟
Empirical Results: Digital Marketing
0.002715
0.003832
n=10000 n=30000 n=60000 n=100000
Expecte
d N
orm
aliz
ed R
etu
rn
None, CUT None, BCa k-Fold, CUT k-Fold, Bca
Empirical Results: Digital Marketing
Empirical Results: Digital Marketing
Example Results : Diabetes Treatment
80
Blood Glucose (sugar)
Eat Carbohydrates Release Insulin
Example Results : Diabetes Treatment
81
Blood Glucose (sugar)
Eat Carbohydrates Release Insulin
Hyperglycemia
Example Results : Diabetes Treatment
82
Blood Glucose (sugar)
Eat Carbohydrates Release Insulin
Hypoglycemia
Hyperglycemia
Example Results : Diabetes Treatment
83
injection =bloodglucose − targetbloodglucose
𝐶𝐹+
mealsize
𝐶𝑅
Example Results : Diabetes Treatment
84
Intelligent Diabetes Management
Example Results : Diabetes Treatment
85
Pro
bab
ility
Po
licy
Ch
ange
d
Pro
bab
ility
Po
licy
Wo
rse
Future Directions
• How to deal with long horizons?
• How to deal with importance sampling being “unfair”?
• What to do when the behavior policy is not known?
• What to do when the behavior policy is deterministic?
Summary
• Safe reinforcement learning • Risk-sensitive • Learning from demonstration • Asymptotic convergence even if data is off-policy • Guaranteed (with probability 𝟏 − 𝜹) not to make the policy worse
• Designing a safe reinforcement learning algorithm: • Off-policy policy evaluation (OPE)
• IS, PDIS, WIS, WPDIS, DR, WDR, US, TSP, MAGIC
• High confidence off-policy policy evaluation (HCOPE) • Hoeffding, CUT inequality, Student’s 𝑡-test, BCa
• Safe policy improvement (SPI) • Selecting the candidate policy
Takeaway
• Safe reinforcement learning is tractable! • Not just polynomial amounts of data – practical amounts of data