Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert

Safe Reinforcement Learning Philip S. Thomas

Stanford CS234: Reinforcement Learning, Guest Lecture

May 24, 2017

Lecture overview

• What makes a reinforcement learning algorithm safe?

• Notation

• Creating a safe reinforcement learning algorithm • Off-policy policy evaluation (OPE)

• High-confidence off-policy policy evaluation (HCOPE)

• Safe policy improvement (SPI)

• Empirical results

• Research directions

What does it mean for a reinforcement learning algorithm to be safe?

Changing the objective

-50 +0 +20 +20

+0 +0 +0 +0 +0 +0 +20

+20 +20 +20

Policy 1

Policy 2

Changing the objective

• Policy 1: • Reward = 0 with probability 0.999999

• Reward = 109 with probability 1-0.999999

• Expected reward approximately 1000

• Policy 2: • Reward = 999 with probability 0.5

• Reward = 1000 with probability 0.5

• Expected reward 999.5

Another notion of safety

Another notion of safety (Munos et. al)

Another notion of safety

The Problem

• If you apply an existing method, do you have confidence that it will work?

Reinforcement learning successes

A property of many real applications

• Deploying “bad” policies can be costly or dangerous.

Deploying bad policies can be costly

Deploying bad policies can be dangerous

What property should a safe algorithm have?

• Guaranteed to work on the first try • “I guarantee that with probability at least 1 − 𝛿, I will not change your policy

to one that is worse than the current policy.”

• You get to choose 𝛿

• This guarantee is not contingent on the tuning of any hyperparameters

Lecture overview


• Notation






Notation

• Policy, 𝜋 𝜋 𝑎 𝑠 = Pr(𝐴𝑡 = 𝑎|𝑆𝑡 = 𝑠)

• History: 𝐻 = 𝑠1, 𝑎1, 𝑟1, 𝑠2, 𝑎2, 𝑟2, … , 𝑠𝐿 , 𝑎𝐿 , 𝑟𝐿

• Historical data: 𝐷 = 𝐻1, 𝐻2, … , 𝐻𝑛

• Historical data from behavior policy, 𝜋b

• Objective:

𝐽 𝜋 = 𝐄 𝛾𝑡𝑅𝑡

𝐿

𝑡=1

𝜋 19

Agent

Environment

Action, 𝑎

State, 𝑠 Reward, 𝑟

Safe reinforcement learning algorithm

• Reinforcement learning algorithm, 𝑎

• Historical data, 𝐷, which is a random variable

• Policy produced by the algorithm, 𝑎(𝐷), which is a random variable

• A safe reinforcement learning algorithm, 𝑎, satisfies:

Pr 𝐽 𝑎 𝐷 ≥ 𝐽 𝜋b ≥ 1 − 𝛿

or, in general: Pr 𝐽 𝑎 𝐷 ≥ 𝐽min ≥ 1 − 𝛿

Lecture overview


• Notation






Creating a safe reinforcement learning algorithm • Off-policy policy evaluation (OPE)

• For any evaluation policy, 𝜋e, Convert historical data, 𝐷, into 𝑛 independent and unbiased estimates of 𝐽 𝜋e

• High-confidence off-policy policy evaluation (HCOPE) • Use a concentration inequality to convert the 𝑛 independent and unbiased

estimates of 𝐽 𝜋e into a 1 − 𝛿 confidence lower bound on 𝐽 𝜋e

• Safe policy improvement (SPI) • Use HCOPE method to create a safe reinforcement learning algorithm, 𝑎

Off-policy policy evaluation (OPE)

Historical Data, 𝐷

Proposed Policy, 𝜋e Estimate of 𝐽(𝜋e)

Importance Sampling (Intuition)

24

Probability of history

Evaluation Policy, 𝜋e Behavior Policy, 𝜋b

𝐽 𝜋𝑒 =1

𝑛 𝛾𝑡𝑅𝑡

𝑖

𝐿

𝑡=1

𝑛

𝑖=1

𝐽 𝜋e =1

𝑛 𝑤𝑖 𝛾𝑡𝑅𝑡

𝑖

𝐿

𝑡=1

𝑛

𝑖=1

Math Slide 2/3

𝑤𝑖 =Pr 𝐻𝑖 𝜋e

Pr 𝐻𝑖 𝜋b

Math Slide 2/3

• Reminder: • History, 𝐻 = 𝑠1, 𝑎1, 𝑟1, 𝑠2, 𝑎2, 𝑟2, … , 𝑠𝐿 , 𝑎𝐿 , 𝑟𝐿

• Objective, 𝐽 𝜋e = 𝐄 𝛾𝑡𝑅𝑡𝐿𝑡=1 𝜋e

= 𝜋e 𝑎𝑡 𝑠𝑡

𝜋b 𝑎𝑡 𝑠𝑡

𝐿

𝑡=1

Importance sampling (History)

• Kahn, H., Marhshall, A. W. (1953). Methods of reducing sample size in Monte Carlo computations. In Journal of the Operations Research Society of America, 1(5):263–278 • Let 𝑋 = 0 with probability 1 − 10−10 and 𝑋 = 1010 with probability 10−10

• 𝐄 𝑋 = 1

• Monte Carlo estimate from 𝑛 ≪ 1010 samples of 𝑋 is almost always zero

• Idea: Sample 𝑋 from some other distribution and use importance sampling to “correct” the estimate

• Can produce lower variance estimates.

• Josiah Hannah et. al, ICML 2017 (to appear).

Importance sampling (History, continued)

• Precup, D., Sutton, R. S., Singh, S. (2000). Eligibility traces for off-policy policy evaluation. In Proceedings of the 17th International Conference on Machine Learning, pp. 759–766. Morgan Kaufmann

Importance sampling (Proof)

• Estimate 𝐄𝑝[𝑓 𝑋 ] given a sample of 𝑋~𝑞

• Let 𝑃 = supp 𝑝 , 𝑄 = supp(𝑞), and 𝐹 = supp(𝑓)

• Importance sampling estimate: 𝑝 𝑋

𝑞 𝑋𝑓 𝑋

𝐄𝑞

𝑝(𝑋)

𝑞(𝑋)𝑓(𝑋) = 𝑞(𝑋)

𝑝(𝑋)

𝑞(𝑋)𝑓(𝑋)

𝑥∈𝑄

= 𝑝(𝑋)𝑓(𝑋)

𝑥∈𝑃

− 𝑝 𝑋 𝑓 𝑋

𝑥∈𝑃∩𝑄


𝑥∈𝑃

+ 𝑝(𝑋)𝑓(𝑋)

𝑥∈𝑃 ∩𝑄


𝑥∈𝑃∩𝑄

Importance sampling (Proof)

• Assume 𝑃 ⊆ 𝑄 (can relax assumption to 𝑃 ⊆ 𝑄 ∪ 𝐹 )

• Importance sampling is an unbiased estimator of 𝐄𝑝 𝑓 𝑋

𝐄𝑞

𝑝(𝑋)

𝑞(𝑋)𝑓(𝑋) = 𝑝(𝑋)𝑓(𝑋)

𝑥∈𝑃


𝑥∈𝑃∩𝑄

= 𝐄𝑝 𝑓 𝑋


𝑥∈𝑃

Importance sampling (proof)

• Assume 𝑓 𝑥 ≥ 0 for all 𝑥

• Importance sampling is a negative-bias estimator of 𝐄𝑝 𝑓 𝑋

𝐄𝑞

𝑝(𝑋)

𝑞(𝑋)𝑓(𝑋) = 𝑝(𝑋)𝑓(𝑋)

𝑥∈𝑃


𝑥∈𝑃∩𝑄

≤ 𝑝(𝑋)𝑓(𝑋)

𝑥∈𝑃

= 𝐄𝑝 𝑓 𝑋

Importance sampling (reminder)

IS 𝐷 =1

𝑛

𝜋e 𝑎𝑡 𝑠𝑡


𝐿

𝑡=1

𝛾𝑡𝑅𝑡𝑖

𝐿

𝑡=1

𝑛

𝑖=1

𝐄 IS(𝐷) = 𝐽 𝜋e






High confidence off-policy policy evaluation (HCOPE)

Historical Data, 𝐷

Proposed Policy, 𝜋𝑒 1 − 𝛿 confidence lower bound on 𝐽(𝜋𝑒)

Probability, 1 − 𝛿

• Let 𝑋1, … , 𝑋𝑛 be 𝑛 independent identically distributed random variables such that𝑋i ∈ [0, 𝑏]

• Then with probability at least 1 − 𝛿:

𝐄 𝑋𝑖 ≥1

𝑛 𝑋𝑖 −

𝑛

𝑖=1

𝑏ln 1

𝛿

2𝑛

Hoeffding’s inequality

Math Slide 3/3

1


𝑖

𝐿

𝑡=1

𝑛

𝑖=1






Safe policy improvement (SPI)

Historical Data, 𝐷 New policy 𝜋, or No Solution Found Probability, 1 − 𝛿

Historical Data

Training Set (20%)

Candidate Policy, 𝜋

Testing Set (80%)

Safety Test

36

Safe policy improvement (SPI)

Is 1 − 𝛿 confidence lower bound on 𝐽 𝜋 larger that 𝐽(𝜋cur)?






WON’T WORK

Off-policy policy evaluation (revisited)

• Importance sampling (IS):

IS 𝐷 =1

𝑛



𝐿

𝑡=1


𝐿

𝑡=1

𝑛

𝑖=1

• Per-decision importance sampling (PDIS)

PDIS 𝐷 = 𝛾𝑡

𝐿

𝑡=1

1

𝑛

𝜋e 𝑎𝜏 𝑠𝜏

𝜋b 𝑎𝜏 𝑠𝜏

𝑡

𝜏=1

𝑅𝑡𝑖

𝑛

𝑖=1


• Importance sampling (IS):

IS 𝐷 =1


𝑖

𝐿

𝑡=1

𝑛

𝑖=1

• Weighted importance sampling (WIS)

WIS 𝐷 =1

𝑤𝑖𝑛𝑖=1

𝑤𝑖 𝛾𝑡𝑅𝑡𝑖

𝐿

𝑡=1

𝑛

𝑖=1


• Weighted importance sampling (WIS)

1

𝑤𝑖𝑛𝑖=1

𝑤𝑖 𝛾𝑡𝑅𝑡𝑖

𝐿

𝑡=1

𝑛

𝑖=1

• Not unbiased. When 𝑛 = 1, 𝐄 WIS = 𝐽 𝜋b

• Strongly consistent estimator of 𝐽 𝜋e

• i.e., Pr lim𝑛→∞

WIS(𝐷) = J 𝜋e = 1

• If • Finite horizon • One behavior policy, or bounded rewards


• Weighted per-decision importance sampling • Also called consistent weighted per-decision importance sampling

• A fun exercise!

Control variates

• Given: 𝑋

• Estimate: 𝜇 = 𝐄 𝑋

• 𝜇 = 𝑋

• Unbiased: 𝐄 𝜇 = 𝐄 𝑋 = 𝜇

• Variance: Var 𝜇 = Var(𝑋)

Control variates

• Given: 𝑋, 𝑌, 𝐄 𝑌

• Estimate: 𝜇 = 𝐄 𝑋

• 𝜇 = 𝑋 − 𝑌 + 𝐄[Y]

• Unbiased: 𝐄 𝜇 = 𝐄 𝑋 − 𝑌 + 𝐄[𝑌] = 𝐄 𝑋 − 𝐄 𝑌 + 𝐄 𝑌 = 𝐄 𝑋 = 𝜇

• Variance: Var 𝜇 = Var 𝑋 − 𝑌 + 𝐄[𝑌] = Var 𝑋 − 𝑌 = Var 𝑋 + Var 𝑌 − 2Cov(𝑋, 𝑌)

• Lower variance if 2Cov 𝑋, 𝑌 > Var(𝑌)

• We call 𝑌 a control variate.


• Idea: add a control variate to importance sampling estimators • 𝑋 is the importance sampling estimator • 𝑌 is a control variate build from an approximate model of the MDP

• 𝐄 𝑌 = 0 in this case

• PDISCV 𝐷 = PDIS 𝐷 − CV(𝐷)

• Called the doubly robust estimator (Jiang and Li, 2015) • Robust to 1) poor approximate model, and 2) error in estimates of 𝜋b

• If the model is poor, the estimates are still unbiased • If the sampling policy is unknown, but the model is good, MSE will still be low

• DR 𝐷 = PDISCV 𝐷

• Non-recursive and weighted forms, as well as control variate view provided by Thomas and Brunskill (2016)


DR 𝜋𝑒 𝒟) = 1

𝑛 𝛾𝑡𝑤𝑡

𝑖 𝑅𝑡𝑖 −𝑞 𝜋e 𝑆𝑡

𝑖 , 𝐴𝑡𝑖 + 𝛾𝑡𝜌𝑡−1

𝑖 𝑣 𝜋𝑒 𝑆𝑡𝑖

∞

𝑡=0

𝑛

𝑖=1

• Recall: we want the control variate, 𝑌, to cancel with 𝑋: 𝑅 − 𝑞 𝑆, 𝐴 + 𝛾𝑣 𝑆′

Per-decision importance sampling (PDIS) 𝑤𝑡

𝑖 = 𝜋e 𝑎𝜏 𝑠𝜏

𝜋b 𝑎𝜏 𝑠𝜏

𝑡

𝜏=1

Empirical Results (Gridworld)

0.001

0.01

0.1

1

10

100

1000

10000

2 20 200 2,000

Mea

n S

qu

ared

Err

or

Number of Episodes, n

IS

AM

Approximate model Direct method (Dudik, 2011) Indirect method (Sutton and Barto, 1998)


0.001

0.01

0.1

1

10

100

1000

10000

2 20 200 2,000

Mea

n S

qu

ared

Err

or


IS

PDIS

AM


0.001

0.01

0.1

1

10

100

1000

10000

2 20 200 2,000

Mea

n S

qu

ared

Err

or


IS

PDIS

DR

AM


0.001

0.01

0.1

1

10

100

1000

10000

2 20 200 2,000

Mea

n S

qu

ared

Err

or


IS

PDIS

WIS

CWPDIS

DR

AM


0.001

0.01

0.1

1

10

100

1000

10000

2 20 200 2,000

Mea

n S

qu

ared

Err

or


IS

PDIS

WIS

CWPDIS

DR

AM

WDR


• What if supp 𝜋e ⊂ supp 𝜋b ?

• There is a state-action pair, 𝑠, 𝑎 , such that

𝜋𝑒 𝑎 𝑠 = 0, but 𝜋𝑏 𝑎 𝑠 ≠ 0

• If we see a history where (𝑠, 𝑎) occurs, what weight should we give it?

IS 𝐷 =1

𝑛



𝐿

𝑡=1


𝐿

𝑡=1

𝑛

𝑖=1


• What if there are zero samples (𝑛 = 0)? • The importance sampling estimate is undefined

• What if no samples are in supp 𝜋e (or supp(𝑝) in general)? • Importance sampling says: the estimate is zero • Alternate approach: undefined

• Importance sampling estimator is unbiased if 𝑛 > 0

• Alternate approach will be unbiased given that at least one sample is in the support of 𝑝.

• Alternate approach detailed in Importance Sampling with Unequal Support (Thomas and Brunskill, AAAI 2017)



• Thomas et. al. Predictive Off-Policy Policy Evaluation for Nonstationary Decision Problems, with Applications to Digital Marketing (AAAI 2017)


• Thomas and Brunskill. Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning (ICML 2016)

0.001

0.01

0.1

1

10

100

1000

10000

2 20 200 2,000

Mea

n S

qu

ared

Err

or


IS

PDIS

WIS

CWPDIS

DR

AM

WDR

MAGIC

0.01

0.1

1

10

1 10 100 1,000 10,000

Mea

n S

qu

ared

Err

or


IS

DR

AM

WDR

MAGIC






High-confidence off-policy policy evaluation (revisited) • Consider using IS + Hoeffding’s inequality for HCOPE on mountain car

Natural Temporal Difference Learning, Dabney and Thomas, 2014

High-confidence off-policy policy evaluation (revisited) • Using 100,000 trajectories

• Evaluation policy’s true performance is 0.19 ∈ [0,1].

• We get a 95% confidence lower bound of:

−5,831,000

What went wrong?

𝑤𝑖 = 𝜋e 𝑎𝑡 𝑠𝑡


𝐿

𝑡=1

What went wrong?

𝐄 𝑋𝑖 ≥1

𝑛 𝑋𝑖 −

𝑛

𝑖=1

𝑏ln 1

𝛿

2𝑛

𝑏 ≈ 109.4

Largest observed importance weighted return:316.

High-confidence off-policy policy evaluation (revisited) • Removing the upper tail only decreases the expected value.

High-confidence off-policy policy evaluation (revisited)

• Thomas et. al, High confidence off-policy evaluation, AAAI 2015


High-confidence off-policy policy evaluation (revisited) • Use 20% of the data to optimize 𝑐.

• Use 80% to compute lower bound with optimized 𝑐.

• Mountain car results:

CUT Chernoff-Hoeffding Maurer Anderson Bubeck et al.

95% Confidence lower bound on

the mean

0.145 −5,831,000 −129,703 0.055 −.046

High-confidence off-policy policy evaluation (revisited) • Digital Marketing:

High-confidence off-policy policy evaluation (revisited) • Cognitive dissonance

𝐄 𝑋𝑖 ≥1

𝑛 𝑋𝑖 −

𝑛

𝑖=1

𝑏ln 1

𝛿

2𝑛

High-confidence off-policy policy evaluation (revisited) • Student’s 𝑡-test

• Assumes that IS(𝐷) is normally distributed

• By the central limit theorem, it (is as 𝑛 → ∞)

• Efron’s Bootstrap methods (e.g., BCa) • Also, without importance sampling: Hanna, Stone, and Niekum, AAMAS 2017


P. S. Thomas. Safe reinforcement learning (PhD Thesis, 2015)






Historical Data

Training Set (20%)

Candidate Policy, 𝜋

Testing Set (80%)

Safety Test

70

Safe policy improvement (revisited)

Is 1 − 𝛿 confidence lower bound on 𝐽 𝜋 larger that 𝐽(𝜋cur)?

• Thomas et. al, ICML 2015

Lecture overview


• Notation






Empirical Results

• What to look for: • Data efficiency

• Error rates

Empirical Results: Mountain Car



Empirical Results: Digital Marketing

Agent

Environment

Action, 𝑎

State, 𝑠 Reward, 𝑟


0.002715

0.003832

n=10000 n=30000 n=60000 n=100000

Expecte

d N

orm

aliz

ed R

etu

rn

None, CUT None, BCa k-Fold, CUT k-Fold, Bca



Example Results : Diabetes Treatment

80

Blood Glucose (sugar)

Eat Carbohydrates Release Insulin


81



Hyperglycemia


82



Hypoglycemia

Hyperglycemia


83

injection =bloodglucose − targetbloodglucose

𝐶𝐹+

mealsize

𝐶𝑅


84

Intelligent Diabetes Management


85

Pro

bab

ility

Po

licy

Ch

ange

d

Pro

bab

ility

Po

licy

Wo

rse

Future Directions

• How to deal with long horizons?

• How to deal with importance sampling being “unfair”?

• What to do when the behavior policy is not known?

• What to do when the behavior policy is deterministic?

Summary

• Safe reinforcement learning • Risk-sensitive • Learning from demonstration • Asymptotic convergence even if data is off-policy • Guaranteed (with probability 𝟏 − 𝜹) not to make the policy worse

• Designing a safe reinforcement learning algorithm: • Off-policy policy evaluation (OPE)

• IS, PDIS, WIS, WPDIS, DR, WDR, US, TSP, MAGIC

• High confidence off-policy policy evaluation (HCOPE) • Hoeffding, CUT inequality, Student’s 𝑡-test, BCa

• Safe policy improvement (SPI) • Selecting the candidate policy

Takeaway

• Safe reinforcement learning is tractable! • Not just polynomial amounts of data – practical amounts of data

Documents

Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning algorithm •Off-policy policy evaluation (OPE) •For any evaluation policy, e, Convert