Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik...

Sequential Off-line Learning with Knowledge Gradients

Peter FrazierWarren PowellSavas Dayanik

Department of Operations Research and Financial EngineeringPrinceton University

Overview

• Problem Formulation• Knowledge Gradient Policy• Theoretical Results• Numerical Results

Measurement Phase

Alternative 1

N opportunities todo experiments

Alternative M

Experimental outcomes

Alternative 3 -.2

Alternative 2 -1

Implementation Phase

Alternative 1

Alternative M

RewardAlternative 3

Alternative 2

Reward StructureOn-line Off-line

Sequential XBatchS

Taxonomy of Sampling Problems

Example 1

• One common experimental design is to spread measurements equally across the alternatives.

xn = n (mod M ) +1

Alternative

Quality

Example 1

xn = n (mod M ) +1Round Robin Exploration

• How might we improve round-robin exploration for use with this prior?

Example 2

xn = argmaxx ¾nx

Example 2

xn = argmaxx ¾nxLargest Variance Exploration

Example 3

Exploitation: xn = argmaxx ¹ nx

• xn is the alternative tested at time n.• Measure the value of alternative xn

• At time n, , independent• Error n+1 is independent N(0,()2).• At time N, choose an alternative.• Maximize

yn+1 = Yxn +"n+1

Yx » N (¹ nx ; (¾n

IE [YxN ]= IE£maxx ¹ Nx¤

State Transition

• At time n measure alternative xn. We update our estimate of based on the measurement

• Estimates of other Yx do not change.

Yxn yn+1

¹ n+1x = (¾n

x )¡ 2¹ nx +(¾²)¡ 2yn+1

(¾nx )¡ 2 +(¾²)¡ 2

(¾n+1x )¡ 2 = (¾n

x )¡ 2 +(¾²)¡ 2for x = xn

¹ n+1x = ¹ n

x for x 6= xn

¾n+1x = ¾n

x for x 6= xn for x 6= xn

• At time n, n+1x is a normal random variable

with mean nx and variance satisfying

uncertainty about Yx before the measurement

uncertainty about Yx after the measurement

change in best estimate of Yx due to the measurement

(~¾n+1x )2

(¾n+1x )2 = (¾n

x )2 ¡ (~¾n+1x )2

VN (¹ N ;¾N ) = maxx

Vn(¹ n ;¾n) = maxxn IE£Vn+1(¹ n+1;¾n+1)¤n

• The value of the optimal policy satisfies Bellman’s equation

Utility of Information

Consider our “utility of information”,U(¹ n;¾n) := maxx ¹ n

x and consider the random change in utility

due to a measurement at time n¢Un+1 := U(¹ n+1;¾n+1) ¡ U(¹ n;¾n)

Knowledge Gradient Definition

• The knowledge gradient policy chooses the measurement that maximizes this expected increase in utility,

xn = argmaxx IEn£¢Un+1¤

Knowledge Gradient

• We may compute the knowledge gradient policy via

IEn£¢Un+1¤ := IEn

£U(¹ n+1;¾n+1) ¡ U(¹ n;¾n)¤

= IEnh³

¹ n+1x

´¡ max

· µ¹ n+1

xn _ maxx6=xn

¶¡ max

which is the expectation of the maximum of a normal and a constant.

Knowledge Gradient

~¾(s) := s2=p

s2 +(¾²)2³nx := ¡ j¹ n

x ¡ maxx06=x

¹ nx0j=~¾(¾n

x )f (z) := z©(z) + ' (z)

The computation becomesargmaxx ~¾(¾n

x )f (³nx )

where is the normal cdf, is the normal pdf, and

Optimality Results

• If our measurement budget allows only one measurement (N=1), the knowledge gradient policy is optimal.

Optimality Results

• The knowledge gradient policy is optimal in the limit as the measurement budget N grows to infinity.

• This is really a convergence result.

Optimality Results• The knowledge gradient policy has sub-optimality

bounded by

where VKG,n gives the value of the knowledge gradient policy and Vn the value of the optimal policy.

· (2¼)¡ 1=2(N ¡ n ¡ 1)maxx ~¾(¾nx )

Vn(¹ n;¾n) ¡ VK G;n(¹ n;¾n)

Optimality Results

• If there are exactly 2 alternatives (M=2), the knowledge gradient policy is optimal.

• In this case, the optimal policy reduces toxn = argmaxx ¾n

Optimality Results

• If there is no measurement noise and alternatives may be reordered so that

then the knowledge gradient policy is optimal.

¹ 01 ¸ ¹ 0

2 ¸ :: : ¸ ¹ 0M

¾01 ¸ ¾0

2 ¸ :: : ¸ ¾0M

Numerical Experiments

• 100 randomly generated problems• M Uniform {1,...100}• N Uniform {M, 3M, 10M}• 0

x Uniform [-1,1]

• (0x)2 = 1 with probability 0.9

= 10-3 with probability 0.1• = 1

Numerical Experiments

• Compare alternatives via a linear combination of mean and standard deviation.

• The parameter z/2 controls the tradeoff between exploration and exploitation.

Interval Estimation

xn = argmaxx ¹ nx +z®=2¾n

KG / IE Comparison

IE and “Sticking”

¹ = [2:5;0;: : : ;0], ¾= [0;1;: : : ;1]

Alternative 1is known perfectly

xn = argmaxx ¹ nx +z®=2¾n

x ; z®=2 = 2:5

IE and “Sticking”

V I E = 2:5; V = IE[max(2:5;Z1; : : : ;ZM )]% 1 as M % 1

Thank YouAny Questions?

Numerical Example 1

xn = argmaxx ~¾nx f (³n

x ) = 1

x ¹ x ¾2x ~¾x = ¾2

¾2x +(¾²)2 ¢ x = j¹ x ¡ maxx06=x

¹ x0j ³x = ¡ ¢ x=~¾x ~¾xf (³x)1 0 4 1.789 0 0 0.7142 0 2 1.155 0 0 0.4613 0 1 0.707 0 0 0.282

Numerical Example 2

x ) = 2

x ¹ x ¾2x ~¾x = ¾2

¾2x +(¾²)2 ¢ x = j¹ x ¡ maxx06=x

¹ x0j ³x = ¡ ¢ x=~¾x ~¾xf (³x)1 1 0 0 1 ¡ 1 02 0 1 0.707 1 -1.414 0.0253 -1 1 0.707 2 -2.828 0.001

Numerical Example 3

x ) = 2

x ¹ x ¾2x ~¾x = ¾2

¾2x +(¾²)2 ¢ x = j¹ x ¡ maxx06=x

¹ x0j ³x = ¡ ¢ x=~¾x ~¾xf (³x)1 1 1 0.707 1 -1.41 0.02512 0 4 1.789 1 -0.56 0.32233 -1 1 0.707 2 -2.83 0.0005

Knowledge Gradient Example

Interval Estimation Example

xn = argmaxx ¹ nx +3¾n

Boltzmann Exploration

Parameterized by a declining sequence of temperatures (T0,...TN-1).

IPn f xn = xg= exp(¹ nx =T n )P M

x 0=1 exp(¹ nx 0=T n )

Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik...

Documents

mehmet savas

Savas Cooks - Επεισόδιο 4

Paul Frazier

Frazier kyle ignite_presentationslideshow

Soguk Savas Ve Turkiye

clausewitz - savas uzerine

Savas Cooks - Επεισόδιο 5

Melis Kanadikirik Savas/ Portfolyo Vol.3

Pandora Frazier Portfolio

FRAZIER RELIABILITY SOLUTIONS

Savas Cooks - Επεισόδιο 2

Frazier final602

Zini savas tiesības - Zrkac€¦ · 5 Zini savas tiesības Par rokasgrāmatu ,,Zini savas tiesības” Latvijas Republikas Satversmes VIII nodaļas „Cilvēka pamattiesības”

Az üvegházhatás és a savas esők

Dr. Savas ALPAY Director General Dr. Savas ALPAY Director General Real Progress Shared Progress

Dr. Savas ALPAY Director General

Verem savas hizmetleri (fazlası için )

Frazier kyle ignite_presentation_slidewhow

Energijos taupymas savas klubas

Savas Cooks Επεισόδιο1