52
(c) 2013 W.B. Powell Outline The knowledge-gradient policy The S-curve effect Online vs. offline problems Learning a continuous function KG with parametric beliefs for drug discovery

ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

Embed Size (px)

Citation preview

Page 1: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

(c) 2013 W.B. Powell

Outline

The knowledge-gradient policy The S-curve effect Online vs. offline problems Learning a continuous function KG with parametric beliefs for drug discovery

Page 2: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

(c) 2013 W.B. Powell 2

The knowledge gradient

Basic principle:» Assume you can make only one measurement, after which you have to

make a final choice (the implementation decision).» What choice would you make now to maximize the expected value of

the implementation decision?

1 2 3 4 5

Change in estimate of value of option

5 due to measurement.

Change which produces a change in the decision.

Page 3: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

(c) 2013 W.B. Powell 3

The knowledge gradient

General model» Off-line learning – We have a measurement budget of

N observations. After we do our measurements, we have to make an implementation decision.

» Notation:

Implementation decisionOur state of knowledge (e.g. mean and variance of our

estimates of costs and other parameters).( , ) Value of making decision given knowledge .

Measurement decision(

yK

F y K y Kx

K x

) Updated distribution of belief about costs.

Page 4: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

(c) 2013 W.B. Powell 4

The knowledge gradient

The knowledge gradient» The knowledge gradient is the expected value of a single

measurement x, given by

» The challenge is a computational one: how do we compute the expectation?

max ( , ( )) max ( , )KGx y yE F y K x F y K

arg maxKnowledge gradient policy

KGx xX

Page 5: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

(c) 2013 W.B. Powell 5

The knowledge gradient

Computing the knowledge gradient» Notation

» We update the precision using

» In terms of the variance, this is the same as

2

1

Precision (inverse variance) of our estimate of the value of .

Precision of the measurement noise ( 1/ )

Measurement of in iteration 1 (unknown at )

n nx x

enx

x

w x n n

1 n nx x

1 1 12, 1 2, 2

2,2, 1

2, 21 /

n nx x

nn x

x nx

Page 6: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

(c) 2013 W.B. Powell 6

The knowledge gradient

Computing the knowledge gradient» The change in variance can be found to be

» Next compute the normalized influence:

» Let

» Knowledge gradient is computed using

2, 1

2, 2, 1

2,

2 2,

|

1 /

n n n nx x x

n nx x

nx

nx

Var S

' 'maxn nn x x x xx n

x

( ) ( ) ( ) ( ) Cumulative standard normal distribution ( ) Standard normal densityf

2,KG n nx x xf

Page 7: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

(c) 2013 W.B. Powell 7

The knowledge gradient

Computing the knowledge gradient

1 2 3 4 5

' 'max

Normalized distance to best (or second best)

n nn x x x xx n

x

nx

x

5KG

Page 8: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

(c) 2013 W.B. Powell

The knowledge gradient

KG calculations illustrated» You may click on the spreadsheet to see the

calculations

Decision mu n̂ beta n̂ beta {̂n+1} sigmatilde max_x' zeta f(z) nu {̂KG)_x1 3.0 0.0156 1.0156 7.9382 5 -0.2519 0.2856 2.26692 4.0 0.0156 1.0156 7.9382 5 -0.1260 0.3391 2.69203 5.0 0.0156 1.0156 7.9382 5 0.0000 0.3989 3.16694 5.0 0.0123 1.0123 8.9450 5 0.0000 0.3989 3.56855 5.0 0.0100 1.0100 9.9504 5 0.0000 0.3989 3.9696

Page 9: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

(c) 2013 W.B. Powell 9

The knowledge gradient

0

0.5

1

1.5

2

2.5

1 2 3 4 5

Choice

muSigmaKG index

Page 10: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

(c) 2013 W.B. Powell 10

The knowledge gradient

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1 2 3 4 5

muSigmaKG index

Page 11: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

(c) 2013 W.B. Powell 11

The knowledge gradient

0

1

2

3

4

5

6

1 2 3 4 5

Cho

ice mu

SigmaKG index

Page 12: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

(c) 2013 W.B. Powell 12

The knowledge gradient

The knowledge gradient policy

Properties» Effectively a myopic policy, but also similar to steepest ascent for

nonlinear programming.» The best single measurement you can make (by construction)» Asymptotically optimal (more difficult proof). As the

measurement budget grows, we get the optimal solution.» The knowledge gradient policy is the only stationary policy with

this behavior.• Many policies are asymptotically optimal (e.g. pure exploration,

hybrid exploration/exploitation, epsilon-greedy), but are not myopically optimal.

,( ) arg maxKG n KG nx xX S

Page 13: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

(c) 2013 W.B. Powell

The knowledge gradient policy

Myopic and asymptotic optimality

Optimal solution

Asymptotically optimal

Fast initial convergence, but stallsIdeal

Page 14: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

(c) 2013 W.B. Powell

The knowledge gradient policy

Myopic and asymptotic optimality

Optimal solution

Myopic optimality (fast initial convergence)

Asymptotic optimality

Knowledge gradient

Ideal

Page 15: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

The knowledge gradient policy

KG versus Gittins indices for multiarmed bandit problems» Gittins indices are provably optimal, but computing them is hard.» Computed using Chick and Gans (2009) approximation

Informative prior

Improvement of KG over Gittins

Uninformative prior

Improvement of KG over Gittins0 1 2 3 4 5 6 7 -5 0 5 10 15

(c) 2013 W.B. Powell

Page 16: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

Knowledge gradient for online learning

But knowledge gradient can also handle:» Finite horizons» Correlated beliefs:

KG vs. Gittins KG vs. Interval estimation

KG vs. Upper confidence bounding KG vs. pure exploitation

(c) 2013 W.B. Powell

Page 17: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

Knowledge gradient for online learning

KG versus interval estimation» Recall that with IE, you choose the alternative with the

highest:IEx x xz

IE beats KG

IE

KG

IE parameter z

Opp

ortu

nity

cos

t

(c) 2013 W.B. Powell

Page 18: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

(c) 2013 W.B. Powell

Outline

The knowledge-gradient policy The S-curve effect Online vs. offline problems Learning a continuous function KG with parametric beliefs for drug discovery

Page 19: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

(c) 2013 W.B. Powell

The nonconcavity of information

We can calculate the value of measurements.» Updating precision would now be

» Compute using

» Calculation of knowledge gradient is the same:

0 xnx x xn

2, 0 0

2,2,0

0

|

1 1

x x

x

x

n nx x x

nx x

nx x

Var S

2,nx

2, 0( ) Value of measurementsxnKGx x x x xn f n

xn

0 00 ' 'max

x

x x x xx n

x

Page 20: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

(c) 2013 W.B. Powell

The nonconcavity of information

The value of information is often concave…

Page 21: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

(c) 2013 W.B. Powell

The nonconcavity of information

… but not always.» The marginal value of a single measurement can be small!

Page 22: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

(c) 2013 W.B. Powell

The nonconcavity of information

What influences the shape?» Consider a baseball hitter whose “true” batting average is

0.300.• Variance of a single at bat is .3 x .7 = .21. Std. dev = .45.• What if the difference in my belief in the batting averages of two

players is .02 (e.g. I think one bats .300 and the other bats .280)• Assume the standard deviation in my belief about the batting average

is also 50 points (.05).

» Click here to bring up spreadsheet.» Notes:

• The expected value of 10 at-bats in terms of increasing the expected batting average is .00087.

• The expected value of 100 at-bats in terms of increasing the expected batting average is .0068.

Page 23: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

(c) 2013 W.B. Powell

The nonconcavity of information Optimal number of choices

As measurement noise increases, the optimal number of alternatives to evaluate decreases.

Number of alternatives being evaluated

Increasing noise

Page 24: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

(c) 2013 W.B. Powell

The nonconcavity of information

Examples of problems with non-concave information» Finding the best hitters for a baseball team» Finding the best stock pickers for an investment fund» Finding the best high value, low volume products to put

in inventory

Implications?» Compare to behavior we learned at the beginning of the

course when we had an “s-curve” problem.

Page 25: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

(c) 2013 W.B. Powell

The nonconcavity of information

The KG(*) policy» Maximize the average value of measurements.

Page 26: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

(c) 2013 W.B. Powell

Outline

The knowledge-gradient policy The S-curve effect Online vs. offline problems Learning a continuous function KG with parametric beliefs for drug discovery

Page 27: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

(c) 2013 W.B. Powell 27

Online vs. offline learning problems

Types of learning probems» On-line learning

• Learn as you earn • Example:

– Finding the best path to work– What is the best set of energy-saving technologies to use

for your building?– What is the best medication to control your diabetes?

• As you collect information, you collect rewards (or lose money). Collecting information is coincident with using the information.

• You have to balance the value of what you earn with a choice now against the benefits of the information you will gain on future decisions.

Page 28: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

(c) 2013 W.B. Powell 28

Online vs. offline learning problems

Types of learning probems» Off-line learning

• There is a phase of information collection with a finite (sometimes small) budget.

• You are allowed to make a series of measurements, after which you make an implementation decision.

• Examples:– Finding the best drug compound through laboratory

experiments– Finding the best design of a manufacturing configuration

or engineering design which is evaluated using an expensive simulation.

– What is the best combination of designs for hydrogen production, storage and conversion.

• Off-line learning separates the process (and costs) of learning from the benefits of using the information that you have gained.

Page 29: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

(c) 2013 W.B. Powell 29

Online vs. offline learning problems

For problems with a finite number of alternatives» On-line learning (learn as you earn)

• This is known in the literature as the multi-armed bandit problem, where you are trying to find the slot machine with the highest payoff.

» Off-line learning• You have a budget for taking measurements. After your

budget is exhausted, you have to make a final choice.• This is known as the ranking and selection problem.

Page 30: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

Online vs. offline learning problems

Knowledge gradient policy» For off-line problems:

» For finite-horizon on-line problems:• Assume we have made 3 measurements out of our budget of 20.• What is the value of learning from one more measurement?• is the improvement in the 4th decision given what we know

after the 3rd measurement. But we benefit from this decision 17 more times.

• The more times we can use the information, the more we are willing to take a loss for future benefits.

,3 3 ,3 3 ,3(20 3) 17KGOL KG KGx x x x x

, Value of a measurement from a single decisionKG nx

,3KGx

(c) 2013 W.B. Powell

Page 31: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

Online vs. offline learning problems

Knowledge gradient policy» For finite-horizon on-line problems:

» For infinite-horizon discounted problems:

Compare to Gittins indices for bandit problems

… and UCB

, ,( )KG OL n n KG nx x xN n

, ,

1KG OL n n KG nx x x

( , )Gittins nx x xn

1 log4UCB nx x n

x

nN

Value of information

Value of information

???

???

(c) 2013 W.B. Powell

Page 32: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

(c) 2013 W.B. Powell

Outline

The knowledge-gradient policy The S-curve effect Online vs. offline problems Learning a continuous function KG with parametric beliefs for drug discovery

Page 33: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

(c) 2013 W.B. Powell

Learning the maximum of a function Choosing prices to maximize revenue

» Measuring a price of $80 tells us something about the response at $81.

» Initial solution

Page 34: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

(c) 2013 W.B. Powell

Learning the maximum of a function Choosing prices to maximize revenue

» Measuring a price of $80 tells us something about the response at $81.

» After three measurements (including endpoints)

Page 35: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

(c) 2013 W.B. Powell

Learning the maximum of a function Choosing prices to maximize revenue

» Measuring a price of $80 tells us something about the response at $81.

» After four measurements

Page 36: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

(c) 2013 W.B. Powell

Learning the maximum of a function Choosing prices to maximize revenue

» Measuring a price of $80 tells us something about the response at $81.

» After 10 measurements

Page 37: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

Parametric beliefs and drug discovery

Learning a concave function

» As the number of observations increase, the policy quickly evolves to pure exploitation.

(c) 2013 W.B. Powell

Page 38: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

Parametric beliefs and drug discovery

Insights» Trying what appears to be

best maximizes profits given what you know, but you may be wrong.

» Generally not a good idea to try ideas that genuinely look bad.

» Best to try ideas that are just off center. You learn more, and you may learn that profits are even higher with different strategies.

(c) 2013 W.B. Powell

Page 39: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

(c) 2013 W.B. Powell

OJ game 2009

Mom&Pop pricing (2009)

Page 40: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

(c) 2013 W.B. Powell

OJ game 2009

Performance of different teams

Page 41: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

(c) 2013 W.B. Powell

OJ game 2010

Challenge» If you are underperforming

• Are your prices right?

• Perhaps they are too high? Too low?

• What is your level of uncertainty?

Page 42: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

(c) 2013 W.B. Powell

Outline

The knowledge-gradient policy The S-curve effect Online vs. offline problems Learning a continuous function KG with parametric beliefs for drug discovery

Page 43: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

(c) 2013 W.B. Powell

Parametric beliefs and drug discovery

Biomedical research» How do we find the

best drug to cure cancer?

» There are millions of combinations, with laboratory budgets that cannot test everything.

» We need a method for sequencing experiments.

Page 44: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

(c) 2013 W.B. Powell

Parametric beliefs and drug discovery

Designing molecules

» X and Y are sites where we can hang substituents to change the behavior of the molecule

Page 45: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

(c) 2013 W.B. Powell

Parametric beliefs and drug discovery

We express our belief using a linear, additive QSAR model»»

0

ij ijsites i substituents j

Y X 1 if substituent is at site , 0 otherwise.m

ijX j i

Page 46: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

(c) 2013 W.B. Powell

If we sample points near the middle, we will have a difficult time estimating the function:

Parametric beliefs and drug discovery

Page 47: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

(c) 2013 W.B. Powell

Sampling near the endpoints produces more stable estimates. Now take this into higher dimensions.

Parametric beliefs and drug discovery

Page 48: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

(c) 2013 W.B. Powell

Parametric beliefs and drug discovery

Knowledge gradient versus pure exploration for 99 compounds

Perf

orm

ance

und

er b

est p

ossi

ble

Number of molecules tested (out of 99)

Pure exploration

Knowledge gradient

Page 49: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

(c) 2013 W.B. Powell

Parametric beliefs and drug discovery

A more complex molecule:

» From this base molecule, we created problems with 10,000 compounds, and one with 87,120 compounds.

R1

R2

R4

R3

R5

18

23

4

5

9

9

74’

3’

2’ 1’

63

3

FOHCHOCOCHOCOCH

3OCHCHNOCIOCOCH

Potential substituents:

Page 50: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

(c) 2013 W.B. Powell

Parametric beliefs and drug discovery

Compact representation on 10,000 combination compound» Results from 15 sample paths

Perf

orm

ance

und

er b

est p

ossi

ble

Number of molecules tested

Page 51: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

(c) 2013 W.B. Powell

Parametric beliefs and drug discovery

Single sample path on molecule with 87,120 combinationsPe

rfor

man

ce u

nder

bes

t pos

sibl

e

Page 52: ORF 411 26 Optimal learning - II - Princeton Universitycastlelab.princeton.edu/html/Presentations/ORF411_2013/ORF 411 26... · wxn n ... n xxxx x n x

(c) 2013 W.B. Powell

Parametric beliefs and drug discovery

Representing beliefs using linear regression has many applications:» How do we find the optimal price of a product sold on

the internet?» Which internet ad will generate the most ad clicks?» How will a customer, described by a set of attributes,

respond to a price for a contract?» What parameter settings produce the best results from

my business simulator?» What are the best features that I should include in a

laptop?