ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

(c) 2013 W.B. Powell

Outline

The knowledge-gradient policy The S-curve effect Online vs. offline problems Learning a continuous function KG with parametric beliefs for drug discovery

(c) 2013 W.B. Powell 2

The knowledge gradient

Basic principle:» Assume you can make only one measurement, after which you have to

make a final choice (the implementation decision).» What choice would you make now to maximize the expected value of

the implementation decision?

1 2 3 4 5

Change in estimate of value of option

5 due to measurement.

Change which produces a change in the decision.



General model» Off-line learning – We have a measurement budget of

N observations. After we do our measurements, we have to make an implementation decision.

» Notation:

Implementation decisionOur state of knowledge (e.g. mean and variance of our

estimates of costs and other parameters).( , ) Value of making decision given knowledge .

Measurement decision(

yK

F y K y Kx

K x

) Updated distribution of belief about costs.



The knowledge gradient» The knowledge gradient is the expected value of a single

measurement x, given by

» The challenge is a computational one: how do we compute the expectation?

max ( , ( )) max ( , )KGx y yE F y K x F y K

arg maxKnowledge gradient policy

KGx xX



Computing the knowledge gradient» Notation

» We update the precision using

» In terms of the variance, this is the same as

2

1

Precision (inverse variance) of our estimate of the value of .

Precision of the measurement noise ( 1/ )

Measurement of in iteration 1 (unknown at )

n nx x

enx

x

w x n n

1 n nx x

1 1 12, 1 2, 2

2,2, 1

2, 21 /

n nx x

nn x

x nx



Computing the knowledge gradient» The change in variance can be found to be

» Next compute the normalized influence:

» Let

» Knowledge gradient is computed using

2, 1

2, 2, 1

2,

2 2,

|

1 /

n n n nx x x

n nx x

nx

nx

Var S

' 'maxn nn x x x xx n

x

( ) ( ) ( ) ( ) Cumulative standard normal distribution ( ) Standard normal densityf

2,KG n nx x xf



Computing the knowledge gradient

1 2 3 4 5

' 'max

Normalized distance to best (or second best)

n nn x x x xx n

x

nx

x

5KG



KG calculations illustrated» You may click on the spreadsheet to see the

calculations

Decision mu n̂ beta n̂ beta {̂n+1} sigmatilde max_x' zeta f(z) nu {̂KG)_x1 3.0 0.0156 1.0156 7.9382 5 -0.2519 0.2856 2.26692 4.0 0.0156 1.0156 7.9382 5 -0.1260 0.3391 2.69203 5.0 0.0156 1.0156 7.9382 5 0.0000 0.3989 3.16694 5.0 0.0123 1.0123 8.9450 5 0.0000 0.3989 3.56855 5.0 0.0100 1.0100 9.9504 5 0.0000 0.3989 3.9696



0

0.5

1

1.5

2

2.5

1 2 3 4 5

Choice

muSigmaKG index

(c) 2013 W.B. Powell 10


0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1 2 3 4 5

muSigmaKG index

(c) 2013 W.B. Powell 11


0

1

2

3

4

5

6

1 2 3 4 5

Cho

ice mu

SigmaKG index

(c) 2013 W.B. Powell 12


The knowledge gradient policy

Properties» Effectively a myopic policy, but also similar to steepest ascent for

nonlinear programming.» The best single measurement you can make (by construction)» Asymptotically optimal (more difficult proof). As the

measurement budget grows, we get the optimal solution.» The knowledge gradient policy is the only stationary policy with

this behavior.• Many policies are asymptotically optimal (e.g. pure exploration,

hybrid exploration/exploitation, epsilon-greedy), but are not myopically optimal.

,( ) arg maxKG n KG nx xX S



Myopic and asymptotic optimality

Optimal solution

Asymptotically optimal

Fast initial convergence, but stallsIdeal



Myopic and asymptotic optimality

Optimal solution

Myopic optimality (fast initial convergence)

Asymptotic optimality

Knowledge gradient

Ideal


KG versus Gittins indices for multiarmed bandit problems» Gittins indices are provably optimal, but computing them is hard.» Computed using Chick and Gans (2009) approximation

Informative prior

Improvement of KG over Gittins

Uninformative prior

Improvement of KG over Gittins0 1 2 3 4 5 6 7 -5 0 5 10 15


Knowledge gradient for online learning

But knowledge gradient can also handle:» Finite horizons» Correlated beliefs:

KG vs. Gittins KG vs. Interval estimation

KG vs. Upper confidence bounding KG vs. pure exploitation


Knowledge gradient for online learning

KG versus interval estimation» Recall that with IE, you choose the alternative with the

highest:IEx x xz

IE beats KG

IE

KG

IE parameter z

Opp

ortu

nity

cos

t



Outline



The nonconcavity of information

We can calculate the value of measurements.» Updating precision would now be

» Compute using

» Calculation of knowledge gradient is the same:

0 xnx x xn

2, 0 0

2,2,0

0

|

1 1

x x

x

x

n nx x x

nx x

nx x

Var S

2,nx

2, 0( ) Value of measurementsxnKGx x x x xn f n

xn

0 00 ' 'max

x

x x x xx n

x



The value of information is often concave…



… but not always.» The marginal value of a single measurement can be small!



What influences the shape?» Consider a baseball hitter whose “true” batting average is

0.300.• Variance of a single at bat is .3 x .7 = .21. Std. dev = .45.• What if the difference in my belief in the batting averages of two

players is .02 (e.g. I think one bats .300 and the other bats .280)• Assume the standard deviation in my belief about the batting average

is also 50 points (.05).

» Click here to bring up spreadsheet.» Notes:

• The expected value of 10 at-bats in terms of increasing the expected batting average is .00087.

• The expected value of 100 at-bats in terms of increasing the expected batting average is .0068.


The nonconcavity of information Optimal number of choices

As measurement noise increases, the optimal number of alternatives to evaluate decreases.

Number of alternatives being evaluated

Increasing noise



Examples of problems with non-concave information» Finding the best hitters for a baseball team» Finding the best stock pickers for an investment fund» Finding the best high value, low volume products to put

in inventory

Implications?» Compare to behavior we learned at the beginning of the

course when we had an “s-curve” problem.



The KG(*) policy» Maximize the average value of measurements.


Outline


(c) 2013 W.B. Powell 27

Online vs. offline learning problems

Types of learning probems» On-line learning

• Learn as you earn • Example:

– Finding the best path to work– What is the best set of energy-saving technologies to use

for your building?– What is the best medication to control your diabetes?

• As you collect information, you collect rewards (or lose money). Collecting information is coincident with using the information.

• You have to balance the value of what you earn with a choice now against the benefits of the information you will gain on future decisions.

(c) 2013 W.B. Powell 28


Types of learning probems» Off-line learning

• There is a phase of information collection with a finite (sometimes small) budget.

• You are allowed to make a series of measurements, after which you make an implementation decision.

• Examples:– Finding the best drug compound through laboratory

experiments– Finding the best design of a manufacturing configuration

or engineering design which is evaluated using an expensive simulation.

– What is the best combination of designs for hydrogen production, storage and conversion.

• Off-line learning separates the process (and costs) of learning from the benefits of using the information that you have gained.

(c) 2013 W.B. Powell 29


For problems with a finite number of alternatives» On-line learning (learn as you earn)

• This is known in the literature as the multi-armed bandit problem, where you are trying to find the slot machine with the highest payoff.

» Off-line learning• You have a budget for taking measurements. After your

budget is exhausted, you have to make a final choice.• This is known as the ranking and selection problem.


Knowledge gradient policy» For off-line problems:

» For finite-horizon on-line problems:• Assume we have made 3 measurements out of our budget of 20.• What is the value of learning from one more measurement?• is the improvement in the 4th decision given what we know

after the 3rd measurement. But we benefit from this decision 17 more times.

• The more times we can use the information, the more we are willing to take a loss for future benefits.

,3 3 ,3 3 ,3(20 3) 17KGOL KG KGx x x x x

, Value of a measurement from a single decisionKG nx

,3KGx



Knowledge gradient policy» For finite-horizon on-line problems:

» For infinite-horizon discounted problems:

Compare to Gittins indices for bandit problems

… and UCB

, ,( )KG OL n n KG nx x xN n

, ,

1KG OL n n KG nx x x

( , )Gittins nx x xn

1 log4UCB nx x n

x

nN

Value of information

Value of information

???

???



Outline



Learning the maximum of a function Choosing prices to maximize revenue

» Measuring a price of $80 tells us something about the response at $81.

» Initial solution




» After three measurements (including endpoints)




» After four measurements




» After 10 measurements

Parametric beliefs and drug discovery

Learning a concave function

» As the number of observations increase, the policy quickly evolves to pure exploitation.



Insights» Trying what appears to be

best maximizes profits given what you know, but you may be wrong.

» Generally not a good idea to try ideas that genuinely look bad.

» Best to try ideas that are just off center. You learn more, and you may learn that profits are even higher with different strategies.



OJ game 2009

Mom&Pop pricing (2009)


OJ game 2009

Performance of different teams


OJ game 2010

Challenge» If you are underperforming

• Are your prices right?

• Perhaps they are too high? Too low?

• What is your level of uncertainty?


Outline




Biomedical research» How do we find the

best drug to cure cancer?

» There are millions of combinations, with laboratory budgets that cannot test everything.

» We need a method for sequencing experiments.



Designing molecules

» X and Y are sites where we can hang substituents to change the behavior of the molecule



We express our belief using a linear, additive QSAR model»»

0

ij ijsites i substituents j

Y X 1 if substituent is at site , 0 otherwise.m

ijX j i


If we sample points near the middle, we will have a difficult time estimating the function:



Sampling near the endpoints produces more stable estimates. Now take this into higher dimensions.




Knowledge gradient versus pure exploration for 99 compounds

Perf

orm

ance

und

er b

est p

ossi

ble

Number of molecules tested (out of 99)

Pure exploration

Knowledge gradient



A more complex molecule:

» From this base molecule, we created problems with 10,000 compounds, and one with 87,120 compounds.

R1

R2

R4

R3

R5

18

23

4

5

9

9

74’

3’

2’ 1’

63

3

FOHCHOCOCHOCOCH

3OCHCHNOCIOCOCH

Potential substituents:



Compact representation on 10,000 combination compound» Results from 15 sample paths

Perf

orm

ance

und

er b

est p

ossi

ble

Number of molecules tested



Single sample path on molecule with 87,120 combinationsPe

rfor

man

ce u

nder

bes

t pos

sibl

e



Representing beliefs using linear regression has many applications:» How do we find the optimal price of a product sold on

the internet?» Which internet ad will generate the most ad clicks?» How will a customer, described by a set of attributes,

respond to a price for a contract?» What parameter settings produce the best results from

my business simulator?» What are the best features that I should include in a

laptop?

Documents

ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x